With alerts, you can start detecting issues before they turn critical. Using NRQL, our New Relic query language, you can create and customize your alerts conditions that focus on what you're most concerned about. Use alerts to be updated about unusual behavior in your data, route notifications to the people who need to see it, and make effective decisions while knowing the root cause of your issue.
Improve your alert coverage by using our recommendations about how to best use alert response, maintenance, quality, and advanced settings. See our Alert quality management guide to know how to measure and improve your alerting quality.
You can follow these recommended actions to improve and get the most out of your alert configuration.
Improve alert response
What should I do
Have an explanatory name for the alert
The description and the tags must give you self-descriptive alerts to know which service is wrong, which environment is involved, which team owns it, and if it's impacting to end users. It helps you to answer faster and decide what to do.
Add tags to your alert conditions
Your issues and incidents have these tags in their metadata. Use them to do flexible filters in workflows or add them to your notification payload.
Issues have 3 sources of tags:
These events are stored in NRDB. Don't worry about the nr* tables consumption because they aren't counted as ingest.
Categorize the alert conditions
Across your organization define alert categories, expectations for handling their notifications, and a unique destination. For example, proactive to Slack to notify before the incident occurs; reactive to PagerDuty to detect and notify of an ongoing incident; or informative to Jira.
Define the communication and escalation method
Decide the means of notifications. Some methods of notifications are: email, Slack, PagerDuty, or Jira.
Add a responsible team
This team is in charge of handling the first notification.
Add a runbook url to every alert condition
The runbook must describe remediation steps to follow and who to involve or escalate to.
Prioritize and triage your alert notifications faster by providing additional metrics specific to the issue.
Improve alert maintenance
What should I do
Organize your policies
We recommend creating a policy for each separate destination or audience needs to receive a notification. Consider grouping by entity, service, or technology to match the focus of your respective teams.
Add an owning team to every alert condition
The owning team guarantees that the alert remains relevant. They approve any changes to the condition.
Schedule a periodical review of alert conditions
Use the Alerts overview page to check the incidents created and decide the action to do. We recommend you to tag the condition with the last review date, which will allow you to identify obsolete alerts.
Automate your alert creation using Terraform
You'll prevent undocumented changes and clear traceability.
Improve alert quality
What should I do
Have SLIs and SLOs in a report
SLIs and SLOs breaches aren't always incidents and don't require an alert unless you've documented steps to prevent them. SLI/SLO can highlight the area where the team should focus on for improvements rather than responding to an event.
Mute alerts during maintenance
It will suppress noisy notifications.
Have a systemic approach when defining alerts for a service
Helps you make sure you cover your entire stack in a consistent way. You can organize your alerts by technology, teams involved, etc.
Review suggested decisions
Every day we analyze your alert data and provide suggested decisions as well as feedback on existing decisions. This will improve noise reduction.
Identify and tune flapping alerts
These alerts indicate a poor alert condition configuration that creates noise. They may still indicate a long going problem in your system, but this isn't an incident.
Increase the threshold or window duration and use sliding window aggregation
Alerts that self-resolve before your team could take any action can clutter your inbox and create noise. Use a dashboard if you want to see short-lived spikes and smooth out temporary spikes.
You'll understand the impact this will have on your incidents closing.
Leverage your custom tags and metadata in decisions.
Related incidents will be correlated into a single, comprehensive issue.
Keep the default decisions enabled when you first get started. Within a couple of weeks, you'll be able to assess the efficacy of these decisions.
If you use decisions, increase the notification grace period
You'll get more incidents correlated with your issues. You'll get richer and more actionable context from the first notification. The longer the grace period, the later the notification.
Periodically review the issue feed
Scroll through the Notified column to make sure all issues are routed to at least one destination. If no routing is necessary, consider removing the condition as it may be noise.
Alert condition creation and advanced settings
If you're new to alerts or if you want suggestions that optimize your alert coverage, pay attention to these recommendations:
- Get recommended alerts by technology
- Let New Relic find your coverage gaps
- Get condition recommendations
Every alert condition generates a signal or multiple signals if the condition contains a facet clause. Each possible facet value will create a distinct signal.
You can query all signals in
NrAiSignal. This allows you to get details about the value that was observed, how many data points were considered, and in the case of anomaly conditions, the expected value and standard deviation. It also gives information about the time delta between New Relic time and your raw data time (if your data is timestamped) which can help you find the most accurate delay setting when creating your conditions.
Maintaining entity health
We use signals to infer the health and alert coverage of an entity. If the results of an alert condition contain data from only one entity, New Relic will tie it to the health of that entity, and these events will show in context in the New Relic UI.
It's recommended, for most conditions, to maintain the existence of a signal. No signal may result in New Relic showing grey (unknown) health status for some entities as well as adding these entities to the list of not covered entities.
where clause of your condition excludes all the data, there will be no data left. This is a loss of signal for New Relic and the alert condition CAN'T be evaluated against ANY threshold. It means that the result of the NRQL query doesn't have data, but it doesn't mean that New Relic is not receiving data. If you want to receive a notification, you must add a loss of signal threshold.
Use the most generic filters in your
where section and the most specific ones in your
select section. Use the filter function to accurately measure what you care about. For example:
Select filter(count(*), where ErrorCode=123) from Transaction where AppName='Application1' and Environment='Production'
Alert delay or timer duration
Try to adjust the delay/time with your data's behavior. A short delay may trigger false alerts because of incomplete data and a large delay may increase the time you get notified. New Relic doesn't know how much data to expect nor how late that data might reach an endpoint of New Relic. Depending of the log shipper and configuration, log data can be batched and see significant delay for low volumes logs.
Set your condition thresholds
Set meaningful threshold levels to optimize alerts for your business. Here are some suggested guidelines:
Set threshold levels
Avoid setting thresholds too low. For example, if you set a CPU condition threshold of 75% for 5 minutes on your production servers, and it routinely goes over that level, this will increase the likelihood of un-actionable alerts or false positives.
Experimenting with settings
You don't need to edit files or restart the software, so feel free to make quick changes to your threshold levels and adjust as necessary.
Adjust your conditions over time.
You can disable any condition in a policy, if necessary. This is useful, for example, if you want to continue using other conditions in the policy while you experiment with other metrics or thresholds.
The color-coded health status indicator in the UI of New Relic changes as the alerting threshold escalates or returns to normal. This allows you to monitor a situation through the UI before having a critical threshold, without needing to receive specific notifications about it. There're two incident thresholds: critical (red) and warning (yellow). Define these thresholds with different criteria, keeping in mind the above-mentioned suggestions.
Ensure your daily batch jobs run
You can set up an alert condition to receive a notification if your batch jobs fail to run.
Assuming you're sending an event to New Relic as part of your batch job, you can set up an alert condition to notify you if your batch jobs fail to run.
Set up a simple count query on the event.SELECT count(*) FROM MyBatchEvent
Set Loss of Signal to open a new incident after 24 hours and 30 minutes. You can adjust this, but it's a good idea to allow for a late-running batch job.
Make sure to use the Event Timer streaming aggregation method. Since you'll only ever get 1 data point every 24 hours, you can set the timer to its lowest setting, 5 seconds.
Use non-null values when there's no signal
By default, gaps in data signals are filled with null values. In cases where you need to be able to create conditions based on those data gaps, you can fill gaps with a custom value or the last known value. You can configure this setting by condition in the UI or configure gap filling values via NerdGraph.
Configuring gap-filling doesn't prevent the 'Loss of Signal' from triggering.
Define your issue creation preferences
Decide when you get issue notifications so you can respond to issues when they happen.
If you're new to alerts, learn more about your issue preferences options.
The default issue preference setting combines all conditions within a policy into one issue. Change your default issue preference setting to increase or decrease the number of issues and issue notifications you receive.
Each team within your organization can have different needs. Ask your team 2 important questions when deciding your issue preferences:
- Do we want to be notified every time something goes wrong?
- Do we want to group all similar notifications together and be notified once?
When a policy and its conditions have a broader scope, like managing the performance of several entities, increase the number of issues you receive. You can need more notifications because 2 issues can't necessarily relate to each other.
When a policy and its conditions have a focused scope, like managing the performance of one entity, opt for the default issue preference. You need fewer notifications when 2 issues are related to each other or when the team is already notified and fixing an existing problem.
To learn more about using alerts:
- Learn about Alerting concepts and terms.
- Learn about the API.
- Read technical details about min/max limits and rules.
- Read more about about when you might want to use loss-of-signal and gap-filling settings.