An alert condition is the core element that defines when an incident is created. It acts as the essential starting point for building any meaningful alert. Alert conditions contain the parameters or thresholds met before you're informed. They can mitigate excessive alerting or tell your team when new or unusual behavior appears.
The Alert conditions list page is the universal hub for all your alert conditions.
Create a new alert condition
An alert condition is a continuously running query that measures a given set of events against a defined threshold and opens an incident when the threshold is met for a specified window of time.
This example demonstrates manually creating a new alert condition using the Alert condition details page. But there are a lot of ways to create an alert condition. You can create an alert condition from:
You can use a NRQL query to define the signals you want an alert condition to use as the foundation for your alert. For this example, you will be using this query:
SELECT average(duration)
FROM PageView
WHERE appName = 'WebPortal'
Using this query for your alert condition tells New Relic you want to know the average pageviews from your WebPortal application. Monitoring pageviews can help you look out for any latency issues in your application.
You can learn more about how to use NRQL, New Relic's query language, see our NRQL documentation.
Fine-tune advanced signal settings
After you've defined your signal, click Run. A chart will appear and display the parameters that you've set.
For this example, the chart will show the average pageviews for your WebPortal application. Click Next and begin configuring your alert condition.
For this example, you will customize these advanced signal settings for the condition you created to monitor an unusual activity for pageviews in your WebPortal application.
The window duration defines how New Relic groups your data for analysis in an alert condition. Choosing the right setting depends on your data's frequency and your desired level of detail:
High-frequency data (for example, pageviews every minute): Set the window duration to match the data frequency (1 minute in this case) for real-time insights into fluctuations and trends.
Low-frequency data (for example, hourly signals): Choose a window duration that captures enough data to reveal patterns and anomalies (for example, 60 minutes for hourly signals).
Remember, you can customize the window duration based on your needs and experience. We recommend using the defaults when starting and experimenting as you become more comfortable creating alert conditions.
Traditional aggregation methods can fall short when dealing with data that's sparsely populated or exhibits significant fluctuations between intervals. Here's how to use sliding window aggregation to analyze such data and trigger timely alerts effectively:
Smooth out the noise: Start by creating a large aggregation window. This window (for example, 5 minutes) acts as a buffer, smoothing out the inherent "noise" or variability in your data. This helps prevent spurious alerts triggered by isolated spikes or dips.
Avoid lag with a sliding window: While a large window helps in data analysis, if you wait for the entire interval to elapse before checking thresholds, you can experience significant delays in alert notifications. We recommend smaller sliding windows (for example, one minute). Imagine this sliding window as a moving frame scanning your data within the larger aggregation window. Each time the frame advances by its smaller interval, it calculates an aggregate value (for example, average).
Set your threshold duration: Now, you can define your alert threshold within the context of the smaller sliding window. This allows you to trigger alerts quickly when the aggregate value in the current frame deviates significantly from the desired range without sacrificing the smoothing effect of the larger window.
You can learn more about sliding window aggregation in this NRQL tutorial.
In general, we recommend using the event flow streaming method. This is best for data that comes into your system frequently and steadily. There are specific cases where event timer might be a better method to choose, but for your first alert, we recommend our default, event flow. We recommend watching this brief video (approx. 5:31 minutes) to understand better which streaming method to choose.
The delay feature in alert conditions safeguards against potential issues arising from inconsistent data collection. It acts as a buffer, allowing extra time for data to arrive and be processed before triggering an alert. This helps prevent false positives and ensures more accurate incident creation.
How it works:
The appropriate delay setting is determined by evaluating the consistency of your incoming data:
Consistent data: A lower delay setting is sufficient if data points consistently arrive with timestamps within a single minute.
Inconsistent data: If data points arrive with timestamps spanning multiple minutes in the past or future, a higher delay setting is necessary to accommodate the inconsistency.
Creating a buffer:
The selected delay setting introduces a waiting period before the alert condition assesses data against defined thresholds.
This buffer allows time for data discrepancies to settle, reducing the likelihood of misleading alerts.
You're creating an alert condition to notify your team of any latency issues with the WebPortal application. In this example, your application consistently sends New Relic data. There is a constant stream of signals being sent from your application to New Relic, and there is no expected gap in signal, so you won't need to select a gap-filling strategy.
Gap-filling strategies address scenarios where data collection might be intermittent or incomplete. They provide a method for substituting missing data points with estimated values, ensuring that alert conditions can still function effectively even with gaps in the data stream.
When to leave gap-filling off:
Consistent data flow: If your application consistently sends data to New Relic without expected gaps, as in the case of the WebPortal application, gap-filling is generally unnecessary. Leaving the gap-filling strategy set to none is often the most appropriate approach in such cases.
Key considerations:
Popular use case: A common use of gap filling is to insert a value of 0 for windows with no data received.
Anomaly thresholds: The gap-filling value is interpreted as the number of standard deviations from the last observed value when using anomaly thresholds. For example, filling gaps with a value of 0 would replicate the last value seen, effectively assuming no change.
Learn more about gap-filling strategies in our lost signal docs.
Set thresholds for alert conditions
If an alert condition is a container, then thresholds are the rules each alert condition must follow. As data streams into your system, the alert condition searches for any incidents of these rules. If the alert condition sees data from your system that has met all the conditions you've set, it will create an incident. An incident signals that something is off in your system, and you should look.
Anomaly thresholds are ideal when you're more concerned about deviations from expected patterns than specific numerical values. They enable you to monitor for unusual activity without needing to set predefined limits. New Relic's anomaly detection dynamically analyzes your data over time, adapting thresholds to reflect evolving system behavior.
Setting up anomaly detection:
Choose upper or lower:
Select upper and lower to be alerted about any higher and lower deviations than expected.
Select upper only to focus solely on unusually high values.
Assign priority:
Set the priority level to critical for your initial alert to ensure prompt attention to potential issues.
Refer to the alert condition docs for more details on priority levels.
Define breach criteria:
Start with the default settings: open an incident when a query returns a value that deviates from the predicted value for at least five minutes by three standard deviations.
Customize these settings as needed to align with your specific application and alerting requirements.
Learn more about anomaly threshold and model behaviors in our anomaly documentation.
Unlike anomaly thresholds, a static threshold doesn't look at your data set as a whole and determines what behavior is unusual based on your system's history. Instead, a static threshold will open an incident whenever your system behaves differently than the criteria that you set.
You need to set the priority level for both anomaly and static thresholds. See the section above for more details.
The loss signal threshold determines how long to wait before considering a missing signal lost. If the signal doesn't return within that time, you can choose to open a new incident or close any related ones. Set the threshold based on your system's expected behavior and data collection frequency. For example, if a lost signal for pageviews could indicate a latency issue, set a threshold you're comfortable with and check the box to open a new lost signal incident.
Add alert condition details
At this point in the process, you now have a fully defined condition and set all the rules to ensure an incident is opened when you want it to be. Based on the settings above, if your alert condition recognizes this behavior in your system that breaches the thresholds that you've set, it will create an incident. Now, all you need to do is to name this condition and attach it to a policy.
The policy is the sorting system for the incident. When you create a policy, you create the tool that organizes all your incoming incidents. You can connect policies to workflows that tell New Relic where you want all this incoming information to go, how often you want it to be sent, and where.
It's essential to give your alert condition a descriptive name. Let's say you name this condition pageviews, and then you create another condition for a completely different application and label that condition pageviews as well. If this occurs, you won't be able to distinguish which condition is for which application. So, make sure to give your condition a specific and unique name. In this case, you'd name this condition pageviews: WebPortal App.
If you already have a policy you want to connect to an alert condition, then select the existing policy.
Balancing responsiveness and fatigue in your alerting strategy is crucial, and you've laid out the key considerations regarding pageview monitoring for your WebPortal application. Let's explore the policy options:
One issue per policy (default):
Pros: Reduces noise and ensures immediate action.
Cons: Groups all incidents under one issue, even if triggered by different conditions. It's not ideal for multiple pageview concerns.
One issue per condition:
Pros: Creates separate issues for each condition, ideal for isolating and addressing specific latency issues.
Cons: Can generate more alerts, potentially leading to fatigue.
An issue for every incident:
Pros: Provides granular detail for external systems but is not optimal for internal consumption due to potential overload.
Cons: It is the noisiest option, and it is challenging to track broader trends and prioritize effectively.
An incident automatically closes when the targeted signal returns to a non-breaching state for the period indicated in the condition's thresholds. This wait time is called the recovery period.
For example, if the breaching behavior is "pageviews are lower than 300 at least once in 5 minutes," the incident will automatically close when pageviews are equal to or higher than 300 for five consecutive minutes.
When an incident closes automatically:
The closing timestamp is backdated to the start of the recovery period.
The evaluation resets and restarts from when the previous incident ended.
All conditions have an incident time limit setting that automatically force-close a long-lasting incident.
New Relic automatically defaults to 3 days and recommends that you use our default settings for your first alert.
Another way to close an open incident when the signal does not return data is by configuring a loss of signal threshold. Refer to the lost signal threshold section above for more details.
Since you're creating an alert condition that lets you know if there are any latency issues with your WebPortal application, you want to make sure your developers have all the information they need when notified about this incident. You will use workflows to notify a team Slack channel when an incident is created.
Learn more about custom incident descriptions in our docs.
If you'd like to link to a runbook, you can put the URL in the runbook URL field.
Edit or improve an existing alert condition
If you want to edit an alert condition you've already created, you can:
From there, you will see the Alert conditions details page. This page contains all the elements you set when you created your condition. You can edit specific aspects of the alert condition by clicking the pencil in the top right of each section.
Signal history
Under Signal history, you can see the most recent results for the NRQL query you used to create your alert condition. For this example, you would see the average pageviews on the WebPortal app for the specific time frame you've set.
For all alert conditions built with NRQL queries, the Signal history will be presented with a line chart.
Any alert condition built with a synthetic monitor will be a bit different. This is because synthetic monitors allow you to ping your application from multiple locations, which can produce positive or negative results each time the monitor runs. This data can only be presented with a table.
Troubleshoot
If you see an empty signal in the history chart, consider one of the following:
Review the condition's settings: Double-check that all elements are configured correctly.
Inspect NRQL queries: Ensure they target valid metrics or events and return data.
Examine entity configuration: Confirm that the entity is set up correctly to send signals.
Consult New Relic documentation: Refer to relevant guides for assistance with specific issues.