Once you've connected your applications to New Relic and have started exploring our charts and dashboards, a good next step is to create an alert to keep your team updated about any unusual behavior in your data. Alerts elevate your New Relic experience from simply ingesting data to taking thoughtful and effective action.
Here, we'll show you how to create a simple alert so you can start learning New Relic alerting and applied intelligence features.
Create your alert condition from a chart
The easiest way to get started with alerts is to create an alert from a New Relic chart. This route is the same as creating a NRQL alert condition from scratch, but the chart already has a NRQL query for you to work with.
An alert condition is essentially a container that you create to define what conditions need to be met before you're notified of any unusual behavior. For this example, you're going to create an alert that notifies your team of any latency issues with web transaction time.
So, in this case, if you want to make sure that web transactions never take longer than 50 milliseconds, you'll build an alert condition to monitor when web transaction time goes beyond 50 milliseconds and creates an incident.
First, go to the chart labeled Web transactions time and click create an alert condition.
It's important to give your alert condition a descriptive name. Let's say you name this condition response time and then you create another condition for a completely different application and label that condition response time as well. If this occurs you won't be able to distinguish which condition is for which application. So, make sure to give your condition a specific and unique name. In this case, we'd name this condition Response time: Example-app.
Once you've named your condition, then you can make changes to the NRQL query if you'd like. For your first alert, we recommend not adjusting the NRQL query and leaving this section as is but if you'd like to learn more about how to use NRQL to customize your queries, visit our docs.
Set thresholds for alert conditions
If an alert condition is a container, then thresholds are the rules that each alert condition contains. As data streams into your system, the alert condition searches for any violations of these rules. If the alert condition sees data coming in from your system that has met all the conditions you've set, then it will create an incident. An incident is a signal that something is off in your system and you should take a look.
Your team is creating an alert condition to look for any latency issues in web transaction time. Now, you're going to create the rules this condition is going to look for.
Your team is creating this alert condition so you'll be notified if web transaction time takes longer than usual. But let's say you don't care how much longer web transaction time is, and you just want to know if transaction time is behaving abnormally. For this specific use case, we'd recommend using our anomaly threshold. Our anomaly detection constantly evaluates your data to understand how your system normally behaves. By setting an anomaly threshold, you can use our anomaly detection to alert your team if web transaction time deviates from its expected performance. Since you only want to know if web transaction time is behaving unusually, you'd select upper and lower because you want notifications of all deviations. But, if you only want alerts if web transaction times takes longer than usual, you'd select upper only.
Next, you need to set the priority level. The priority level determines what will create an incident. We recommend setting your priority level as critical for your first alert. You can learn more about priority levels in our alert condition docs.
Next, you must choose what defines a critical anomaly threshold breach. For this first alert, we recommend using our default settings and adjusting to your needs as necessary. So, leave the settings to open an incident "when a query returns a value deviates from the predicted value: for at least five minutes by 3 standard deviations".
Learn more about anomalies in our anomaly documentation.
Unlike anomaly thresholds, a static threshold doesn't look at your data set as a whole and determines what behavior is unusual based on your system's history. Instead, a static threshold will open an incident whenever your system behaves differently than the criteria that you set. Static alert thresholds are much more customizable, and we recommend them if you've a strong sense of your data and what you're looking for.
Learn more about our static alert conditions in our NRQL docs.
Sometimes an incoming signal can be lost, and it's important to understand if it's simply a delay or an indication of a wider problem. Our loss signal threshold lets us know how many seconds the system should wait from the time the last data point was detected before considering the signal to be lost. If the signal does not return before the time limit that you set, you can choose one or both of two actions: open a new incident, and/or close all open incidents. You can close any related open incidents if the lost signal supersedes other incidents on this entity, or if the signal loss was expected.
Setting your lost signal threshold requires knowledge of your system and what you're looking to understand. In the case of web transaction time, let's say New Relic collects a signal every minute. A lost signal could indicate a much larger latency issue. So, we recommend setting the time to your comfort level and then checking the box to open a new lost signal incident.
Fine-tune advanced signal settings
New Relic constantly observes the data that streams from your applications and into our system. But not all applications send signals at the same frequency or cadence. Some events could send signals to our system once a minute while others could only report data to New Relic once every day. An alert condition is a specific container designed for a specific use case. When creating an alert condition this section is the most customizable for the data that you're evaluating.
We're going to customize these advanced signal settings for our condition that is looking for web transaction latency issues.
Setting the window duration for your alert condition tells New Relic how to group your data. If you're creating an alert condition for a data set that sends a signal to New Relic once every hour, you'd want to set the window duration to something closer to sixty minutes because it'll help spot patterns and unusual behavior. But, if you're creating an alert condition for web transaction time and New Relic collects a signal for that data every minute, we'd recommend setting the window duration to one minute.
For your first alert we recommend sticking with our default settings, but the more you get familiar with creating an alert condition we encourage you to customize these fields based on your own experience.
Throughout the day, data streams from your application into New Relic. Instead of evaluating that data immediately for incidents, alert conditions collect the data over a period of time known as the aggregation window. An additional delay allows for slower data points to arrive before the window is aggregated.
Sliding windows are helpful when you need to smooth out "spiky" charts. One common use case is to use sliding windows to smooth line graphs that have a lot of variation over short periods of time in cases where the rolling aggregate is more important than aggregates from narrow windows of time.
We recommend using our sliding window aggregation if you're not expecting to have a steady and consistent stream of data but are expecting some dips and spikes in data.
You can learn more about sliding window aggregation in this NRQL tutorial or by watching this video.
In general, we recommend using the event flow streaming method. This is best for data that comes into your system frequently and steadily. There are specific cases where event timer might be a better method to choose, but for your first alert we recommend our default, event flow. To better understand which streaming method to choose, we recommend watching this brief video.
The delay feature protects you against inconsistent data collection. It gives the alert condition a little wiggle room before deciding to create an incident. If in any given minute your data arrives at New Relic with timestamps from only a single minute, then a low delay setting is enough. On the other hand, if during that minute New Relic receives data points with timestamps from several minutes past or several minutes forward, then your signal is more inconsistent and will need a higher delay setting.
We're creating an alert condition to notify our team of any latency issues with web transaction time. In this case, our application consistently sends New Relic data. There is a constant stream of signals being sent from our application to New Relic and there is no expected gap in signal so we won't need to select a gap-filling strategy. For this use case, and for your first alert, we recommend leaving the gap-filling strategy set to none.
If you have a more inconsistent data set sending signals to New Relic once every twenty-four hours, then we'd recommend customizing this feature based on your team's specific needs.
Learn more about gap-filling strategies in our lost signal docs.
Connect your condition to a policy
If there are any latency issues with web transaction times, we'd like to receive a notification as soon as possible. The best quick and efficient action is to create an alert condition that will open an incident if web transaction times take too long.
This alert condition is a container that holds all the rules—are we using static or anomaly thresholds, are we using a sliding-window aggregation or just leaving the evaluation period as normal?
At this point in the process we now have a fully defined container and we've set all the rules to make sure an incident is opened when we want it to be. Based on the settings above, if our alert condition recognizes this behavior in our system that "violates" the thresholds that we've set, it will create an incident. Now, all we need to do is to attach this container to a policy.
The policy is the sorting system for the incident. When you create a policy, you're creating the tool that organizes all of your incoming incidents. You can connect policies to workflows that tell New Relic where you want all this incoming information to go, how often you want it to be sent, and where.
If you already have a policy you want to connect to an alert condition, then select the existing policy.
If we want to create a new policy for this alert condition, here's our chance. Remember, a policy is connected to workflows and workflows control how often we're notified about any incidents. It's a fine balance between ensuring that we learn about any issues with web transaction time as quickly as possible and making sure that we don't get so many alerts that our developers experience fatigue and start missing out on important information because of information overload.
Policies can hold one or multiple conditions. If you're looking to monitor web transaction latency, you have a few options.
First, you could create a policy that only attaches a single issue per policy (the default option). One issue per policy reduces noise but also requires immediate action. But this means that if you've attached multiple conditions to this policy, not just to Response time: Example app, then no matter what, all incidents in this policy will be grouped into one single issue.
Or we could create one issue per condition. This means that any time the Response time: Example app condition opens an incident, all those incidents will be rolled into one issue that is connected to our condition. For this specific use case, you should choose this option because it meets the primary goal, which is to monitor latency issues with web transaction time.
Or we could create an issue for every incident. This option is the most noisy but can work well if you like to send information to an external system.
An incident will automatically close when the targeted signal returns to a non-violating state for the time period indicated in the condition's thresholds. This wait time is called the recovery period.
For example, if the violating behavior is "web transaction time is longer than .50 seconds at least once in 5 minutes," then the incident will automatically close when web transaction time is equal to or lower than .50 for 5 consecutive minutes.
When an incident closes automatically:
The closing timestamp is backdated to the start of the recovery period.
The evaluation resets and restarts from when the previous incident ended.
All conditions have an incident time limit setting that will automatically force-close a long-lasting incident.
We automatically default to 3 days and recommend that you use our default settings for your first alert.
Since we're creating an alert condition that lets us know if there are any latency issues with web transaction time, we want to make sure our developers have all the information they need when they are notified about this incident. We're going to use workflows to notify a team Slack channel when an incident is created.
Learn more about custom incident descriptions in our docs.
If you'd like to link to a runbook, you can put the URL in the runbook URL field.