An alert condition is the core element that defines when an incident is created. It acts as the essential starting point for building any meaningful alert. Alert conditions contain the parameters or thresholds met before you're informed. They can mitigate excessive alerting or tell your team when new or unusual behavior appears.
Create a new alert condition
An alert condition is a continuously running query that measures a given set of events against a defined threshold and opens an incident when the threshold is met for a specified window of time.
This example demonstrates manually creating a new alert condition using the Alert condition details page. But there are a lot of ways to create an alert condition. You can create an alert condition from:
You can use a NRQL query to define the signals you want an alert condition to use as the foundation for your alert. For this example, you will be using this query:
SELECT average(duration)
FROM PageView
WHERE appName ='WebPortal'
Using this query for your alert condition tells New Relic you want to know the average latency, or duration, to load pages within your WebPortal application. Proactive alerting on latency, a core golden signal, provides early warnings of potential degradation.
You can learn more about how to use NRQL, New Relic's query language, see our NRQL documentation.
Fine-tune advanced signal settings
After you've defined your signal, click Run. A chart will appear and display the parameters that you've set.
For this example, the chart will show the average latency for your WebPortal application. Click Next and begin configuring your alert condition.
For this example, you will customize these advanced signal settings for the condition you created to monitor latency in your WebPortal application.
The window duration defines how New Relic groups your data for analysis in an alert condition. Choosing the right setting depends on your data's frequency and your desired level of detail:
High-frequency data (for example, pageviews every minute): Set the window duration to match the data frequency (1 minute in this case) for real-time insights into fluctuations and trends.
Low-frequency data (for example, hourly signals): Choose a window duration that captures enough data to reveal patterns and anomalies (for example, 60 minutes for hourly signals).
Remember, you can customize the window duration based on your needs and experience. We recommend using the defaults when starting and experimenting as you become more comfortable creating alert conditions.
Traditional aggregation methods can fall short when dealing with data that's sparsely populated or exhibits significant fluctuations between intervals. Here's how to use sliding window aggregation to analyze such data and trigger timely alerts effectively:
Smooth out the noise: Start by creating a large aggregation window. This window (for example, 5 minutes) acts as a buffer, smoothing out the inherent "noise" or variability in your data. This helps prevent spurious alerts triggered by isolated spikes or dips.
Avoid lag with a sliding window: While a large window helps in data analysis, if you wait for the entire interval to elapse before checking thresholds, you can experience significant delays in alert notifications. We recommend smaller sliding windows (for example, one minute). Imagine this sliding window as a moving frame scanning your data within the larger aggregation window. Each time the frame advances by its smaller interval, it calculates an aggregate value (for example, average).
Set your threshold duration: Now, you can define your alert threshold within the context of the smaller sliding window. This allows you to trigger alerts quickly when the aggregate value in the current frame deviates significantly from the desired range without sacrificing the smoothing effect of the larger window.
You can learn more about sliding window aggregation in this NRQL tutorial.
In general, we recommend using the event flow streaming method. This is best for data that comes into your system frequently and steadily. There are specific cases where event timer might be a better method to choose, but for your first alert, we recommend our default, event flow. We recommend watching this brief video (approx. 5:31 minutes) to understand better which streaming method to choose.
The delay feature in alert conditions safeguards against potential issues arising from inconsistent data collection. It acts as a buffer, allowing extra time for data to arrive and be processed before triggering an alert. This helps prevent false positives and ensures more accurate incident creation.
How it works:
The appropriate delay setting is determined by evaluating the consistency of your incoming data:
Consistent data: A lower delay setting is sufficient if data points consistently arrive with timestamps within a single minute.
Inconsistent data: If data points arrive with timestamps spanning multiple minutes in the past or future, a higher delay setting is necessary to accommodate the inconsistency.
Creating a buffer:
The selected delay setting introduces a waiting period before the alert condition assesses data against defined thresholds.
This buffer allows time for data discrepancies to settle, reducing the likelihood of misleading alerts.
You're creating an alert condition to notify your team of any latency issues with the WebPortal application. In this example, your application consistently sends New Relic data. There is a constant stream of signals being sent from your application to New Relic, and there is no expected gap in signal, so you won't need to select a gap-filling strategy.
Gap-filling strategies address scenarios where data collection might be intermittent or incomplete. They provide a method for substituting missing data points with estimated values, ensuring that alert conditions can still function effectively even with gaps in the data stream.
When to leave gap-filling off:
Consistent data flow: If your application consistently sends data to New Relic without expected gaps, as in the case of the WebPortal application, gap-filling is generally unnecessary. Leaving the gap-filling strategy set to none is often the most appropriate approach in such cases.
Key considerations:
Popular use case: A common use of gap filling is to insert a value of 0 for windows with no data received.
Anomaly thresholds: The gap-filling value is interpreted as the number of standard deviations from the last observed value when using anomaly thresholds. For example, filling gaps with a value of 0 would replicate the last value seen, effectively assuming no change.
Learn more about gap-filling strategies in our lost signal docs.
Set thresholds for alert conditions
If an alert condition is a container, then thresholds are the rules each alert condition must follow. As data streams into your system, the alert condition searches for any incidents of these rules. If the alert condition sees data from your system that has met all the conditions you've set, it will create an incident. An incident signals that something is off in your system, and you should look.
Anomaly thresholds are ideal when you're more concerned about deviations from expected patterns than specific numerical values. They enable you to monitor for unusual activity without needing to set predefined limits. New Relic's anomaly detection dynamically analyzes your data over time, adapting thresholds to reflect evolving system behavior.
Setting up anomaly detection:
Choose upper or lower:
Select upper and lower to be alerted about any higher and lower deviations than expected.
Select upper only to focus solely on unusually high values.
Assign priority:
Set the priority level to critical for your initial alert to ensure prompt attention to potential issues.
Define breach criteria:
Start with the default settings: open an incident when a query returns a value that deviates from the predicted value for at least five minutes by three standard deviations.
Customize these settings as needed to align with your specific application and alerting requirements.
Learn more about anomaly threshold and model behaviors in our anomaly documentation.
Unlike anomaly thresholds, a static threshold doesn't look at your data set as a whole and determines what behavior is unusual based on your system's history. Instead, a static threshold will open an incident whenever your system behaves differently than the criteria that you set.
You need to set the priority level for both anomaly and static thresholds. See the section above for more details.
The lost signal threshold determines how long to wait before considering a missing signal lost. If the signal doesn't return within that time, you can choose to open a new incident or close any related ones. You can also choose to skip opening an incident when a signal is expected to terminate. Set the threshold based on your system's expected behavior and data collection frequency. For example, if a website experiences a complete loss of traffic, or throughput, the corresponding telemetry data sent to New Relic will also cease. Monitoring for this loss of signal can serve as an early warning system for such outages.
Add alert condition details
At this point in the process, you now have a fully defined condition and set all the rules to ensure an incident is opened when you want it to be. Based on the settings above, if your alert condition recognizes this behavior in your system that breaches the thresholds that you've set, it will create an incident. Now, all you need to do is to name this condition and attach it to a policy.
The policy is the sorting system for the incident. When you create a policy, you create the tool that organizes all your incoming incidents. You can connect policies to workflows that tell New Relic where you want all this incoming information to go, how often you want it to be sent, and where.
A best practice for condition naming involves a structured format that conveys essential information at a glance. Include the following elements in your condition names:
Priority: Indicate the severity or urgency of the alert, like P1, P2, P3.
Signal: Specify the metric or condition being monitored, like High Avg Latency or Low Throughput.
Entity: Identify the affected system, application, or component, like WebPortal App or Database Server.
An example of a well-formed condition name following this structure would be P2 | High Avg Latency | WebPortal App.
If you already have a policy you want to connect to an alert condition, then select the existing policy.
Balancing responsiveness and fatigue in your alerting strategy is crucial, and you've laid out the key considerations regarding pageview monitoring for your WebPortal application. Let's explore the policy options:
One issue per policy (default):
Pros: Reduces noise and ensures immediate action.
Cons: Groups all incidents under one issue, even if triggered by different conditions. It's not ideal for multiple pageview concerns.
One issue per condition:
Pros: Creates separate issues for each condition, ideal for isolating and addressing specific latency issues.
Cons: Can generate more alerts, potentially leading to fatigue.
An issue for every incident:
Pros: Provides granular detail for external systems but is not optimal for internal consumption due to potential overload.
Cons: It is the noisiest option, and it is challenging to track broader trends and prioritize effectively.
An incident automatically closes when the targeted signal returns to a non-breaching state for the period indicated in the condition's thresholds. This wait time is called the recovery period.
For example, if you're measuring latency and the breaching behavior is that duration to load pages in your WebPortal application has increased to more than 3 seconds, the incident will automatically close when duration is equal to or lower than 3 seconds for 5 consecutive minutes.
When an incident closes automatically:
The closing timestamp is backdated to the start of the recovery period.
The evaluation resets and restarts from when the previous incident ended.
All conditions have an incident time limit setting that automatically force-close a long-lasting incident.
New Relic automatically defaults to 3 days and recommends that you use our default settings for your first alert.
Another way to close an open incident when the signal does not return data is by configuring a loss of signal threshold. Refer to the lost signal threshold section above for more details.
Since you're creating an alert condition that lets you know if there are any latency issues with your WebPortal application, you want to make sure your developers have all the information they need when notified about this incident. You will use workflows to notify a team Slack channel when an incident is created.
Learn more about custom incident descriptions in our docs.
Using the title template is optional but we recommend it. An alert condition defines a set of thresholds you want to monitor. If any of those thresholds are breached, an incident is created. Meaningful title templates help you pinpoint issues and resolve outages faster.
From there, you will see the Alert conditions details page. This page contains all the elements you set when you created your condition. You can edit specific aspects of the alert condition by clicking the pencil in the top right of each section.
Signal history
Under Signal history, you can see the most recent results for the NRQL query you used to create your alert condition. For this example, you would see the average latency on the WebPortal app for the specific time frame you've set.
For all alert conditions built with NRQL queries, the Signal history will be presented with a line chart.
Any alert condition built with a synthetic monitor will be a bit different. This is because synthetic monitors allow you to ping your application from multiple locations, which can produce positive or negative results each time the monitor runs. This data can only be presented with a table.
Types of conditions
The primary and recommended condition type is a NRQL alert condition, but there are other types of conditions. We've included a complete list of our condition types below.
Anomaly alerting allows you to create conditions that dynamically adjust to changing data and trends, such as weekly or seasonal patterns. This feature is available for and apps, as well as NRQL queries.
You can set thresholds that open an incident when they are breached by any of your Java app's instance metrics.
By scoping thresholds to specific instances, you can more quickly identify where potential problems are originating. This is useful, for example, to detect anomalies that are occurring only in a subset of your app's instances. These sorts of anomalies are easy to miss for apps that aggregate metrics across a large number of instances.
For Java apps monitored by APM, you can set thresholds that open an incident when the heap size or number of threads for a single JVM is out of the expected operating range.
We evaluate alerting threshold breaches individually for each of the app's selected instances. When creating your condition, select JVM health metric as the type of condition for your Java app's alert policy, then select any of the available thresholds:
Deadlocked threads
Heap memory usage
CPU utilization time
Garbage collection CPU time
Incidents will automatically close when the inverse of the threshold is met, but by using the UI you can also change the time when an incident force-closes for a JVM health metric. Default is 24 hours.
We include the option to define a percentile as the threshold for your condition when your web app's response time is above, below, or equal to this value. This is useful, for example, when Operations personnel want to alert on a percentile for an app server's overallweb transaction response time rather than the average web response time.
Select Web transactions percentiles as the type of condition for your app's condition, then select a single app. (To alert on more than one app, create an individual Web transactions percentiles condition for each.)
To define the thresholds that open the incident, type the Percentile nth response time value, then select its frequency (above, below, or equal to this value).
We store the transaction time in milliseconds, although the user interface shows the Critical and Warning values as seconds. If you want to define milliseconds, be sure to include the decimal point in your value.
By applying labels to applications, you can automatically link these entities to your condition. This makes it easy to manage all the applications within a dynamic environment. We recommend using the agent configuration file to best maintain entity labels.
A single label identifies all entities associated with that label (maximum 10,000 entities). Multiple labels only identify entities which share all the selected labels.
Using dynamic targeting with your condition also requires that you set an incident close timer.
To add, edit, or remove up to ten labels for a condition:
Select APM > Application metric as the product type.
When identifying entities, select the Labels tab. Search for a label by name, or select a label from the list of categories.
You can also create conditions directly within the context of what you're monitoring with Infrastructure.
For example, to get a notification when an infrastructure agent stops receiving data, use the host not reporting condition type. This allows you to dynamically alert on filtered groups of hosts and configure the time window from 5 to 60 minutes.
From the list of alert conditions, click on the three dots icon of the alert you want to copy and select Duplicate condition.
From the Copy alert condition, search or scroll the list to select the policy where you want to add this condition.
Optional: Change the condition's name if necessary.
Optional: Click the toggle switch to Enable on save
Select Copy condition.
By default, the selected alert policy will add the copied condition in the Disabled state. Follow standard procedures to add or copy more conditions to the alert policy, and then Enable the condition as needed. Additionally, the new condition will not copy any tags added to the original condition.
Enable/disable a condition
To disable or re-enable a condition:
Go to one.newrelic.com > All capabilities > Alerts > Alert Conditions. Then, from the list of Alert conditions, use the toggle to enable or disable the condition.
Click the On/Off switch to toggle it.
If you copy a condition, it automatically saves it in the new policy as disabled (Off), even if the condition was enabled (On) in the original policy.
Delete a condition
To turn a condition off but keep it with the policy, disable it. To delete one or more conditions: