• English日本語한국어
  • Log inStart now

Alerts best practices

With alerts, you can start detecting issues before they turn critical. Using NRQL, our New Relic query language, you can create and customize your alerts conditions that focus on what you're most concerned about. Use alerts to be updated about unusual behavior in your data, route notifications to the people who need to see it, and make effective decisions while knowing the root cause of your issue.

Improve your alert coverage by using our recommendations about how to best use alert response, maintenance, quality, and advanced settings. See our Alert quality management guide to know how to measure and improve your alerting quality.

You can follow these recommended actions to improve and get the most out of your alert configuration.

Tip

Have you created your first alert yet? If not, check out our tutorial series for all the steps you need to get started.

Improve alert response

What should I do

Benefit

Have an explanatory name for the alert

The description and the tags must give you self-descriptive alerts to know which service is wrong, which environment is involved, which team owns it, and if it's impacting to end users. It helps you to answer faster and decide what to do.

Add tags to your alert conditions

Your issues and incidents have these tags in their metadata. Use them to do flexible filters in workflows or add them to your notification payload.

Issues have 3 sources of tags:

  • The tags defined on the alert condition.
  • The current values from the facet/where clause from the alert condition.
  • The tags from the breaching entity if the alerts results are scoped to a unique entity. If the incident isn't scoped to the entity, entity tags won't be brought over.

These events are stored in NRDB. Don't worry about the nr* tables consumption because they aren't counted as ingest.

Categorize the alert conditions

Across your organization define alert categories, expectations for handling their notifications, and a unique destination. For example, proactive to Slack to notify before the incident occurs; reactive to PagerDuty to detect and notify of an ongoing incident; or informative to Jira.

Define the communication and escalation method

Decide the means of notifications. Some methods of notifications are: email, Slack, PagerDuty, or Jira.

Add a responsible team

This team is in charge of handling the first notification.

Add a runbook url to every alert condition

The runbook must describe remediation steps to follow and who to involve or escalate to.

Use enrichments

Prioritize and triage your alert notifications faster by providing additional metrics specific to the issue.

Improve alert maintenance

What should I do

Benefit

Organize your policies

We recommend creating a policy for each separate destination or audience needs to receive a notification. Consider grouping by entity, service, or technology to match the focus of your respective teams.

Add an owning team to every alert condition

The owning team guarantees that the alert remains relevant. They approve any changes to the condition.

Schedule a periodical review of alert conditions

Use the Alerts overview page to check the incidents created and decide the action to do. We recommend you to tag the condition with the last review date, which will allow you to identify obsolete alerts.

Automate your alert creation using Terraform

You'll prevent undocumented changes and clear traceability.

Improve alert quality

What should I do

Benefits

Have SLIs and SLOs in a report

SLIs and SLOs breaches aren't always incidents and don't require an alert unless you've documented steps to prevent them. SLI/SLO can highlight the area where the team should focus on for improvements rather than responding to an event.

Mute alerts during maintenance

It will suppress noisy notifications.

Have a systemic approach when defining alerts for a service

Helps you make sure you cover your entire stack in a consistent way. You can organize your alerts by technology, teams involved, etc.

Review suggested decisions

Every day we analyze your alert data and provide suggested decisions as well as feedback on existing decisions. This will improve noise reduction.

Identify and tune flapping alerts

These alerts indicate a poor alert condition configuration that creates noise. They may still indicate a long going problem in your system, but this isn't an incident.

Increase the threshold or window duration and use sliding window aggregation

Alerts that self-resolve before your team could take any action can clutter your inbox and create noise. Use a dashboard if you want to see short-lived spikes and smooth out temporary spikes.

You'll understand the impact this will have on your incidents closing.

Use decisions

Leverage your custom tags and metadata in decisions.

Related incidents will be correlated into a single, comprehensive issue.

Keep the default decisions enabled when you first get started. Within a couple of weeks, you'll be able to assess the efficacy of these decisions.

If you use decisions, increase the notification grace period

You'll get more incidents correlated with your issues. You'll get richer and more actionable context from the first notification. The longer the grace period, the later the notification.

Periodically review the issue feed

Scroll through the Notified column to make sure all issues are routed to at least one destination. If no routing is necessary, consider removing the condition as it may be noise.

Alert condition creation and advanced settings

If you're new to alerts or if you want suggestions that optimize your alert coverage, pay attention to these recommendations:

Understand signal

Every alert condition generates a signal or multiple signals if the condition contains a facet clause. Each possible facet value will create a distinct signal.

You can query all signals in NrAiSignal. This allows you to get details about the value that was observed, how many data points were considered, and in the case of anomaly conditions, the expected value and standard deviation. It also gives information about the time delta between New Relic time and your raw data time (if your data is timestamped) which can help you find the most accurate delay setting when creating your conditions.

Maintaining entity health

We use signals to infer the health and alert coverage of an entity. If the results of an alert condition contain data from only one entity, New Relic will tie it to the health of that entity, and these events will show in context in the New Relic UI.

It's recommended, for most conditions, to maintain the existence of a signal. No signal may result in New Relic showing grey (unknown) health status for some entities as well as adding these entities to the list of not covered entities.

If the where clause of your condition excludes all the data, there will be no data left. This is a loss of signal for New Relic and the alert condition CAN'T be evaluated against ANY threshold. It means that the result of the NRQL query doesn't have data, but it doesn't mean that New Relic is not receiving data. If you want to receive a notification, you must add a loss of signal threshold.

Use the most generic filters in your where section and the most specific ones in your select section. Use the filter function to accurately measure what you care about. For example:

Select filter(count(*), where ErrorCode=123) from Transaction where AppName='Application1' and Environment='Production'

Alert delay or timer duration

Try to adjust the delay/time with your data's behavior. A short delay may trigger false alerts because of incomplete data and a large delay may increase the time you get notified. New Relic doesn't know how much data to expect nor how late that data might reach an endpoint of New Relic. Depending of the log shipper and configuration, log data can be batched and see significant delay for low volumes logs.

Set your condition thresholds

Set meaningful threshold levels to optimize alerts for your business. Here are some suggested guidelines:

Action

Recommendations

Set threshold levels

Avoid setting thresholds too low. For example, if you set a CPU condition threshold of 75% for 5 minutes on your production servers, and it routinely goes over that level, this will increase the likelihood of un-actionable alerts or false positives.

Experimenting with settings

You don't need to edit files or restart the software, so feel free to make quick changes to your threshold levels and adjust as necessary.

Adjust settings

Adjust your conditions over time.

  • Tighten your thresholds as long as you use New Relic to keep pace with your improved performance.
  • If you're rolling out something that you know will negatively impact your performance for a period of time, loosen your thresholds to allow for this.

Disable settings

You can disable any condition in a policy, if necessary. This is useful, for example, if you want to continue using other conditions in the policy while you experiment with other metrics or thresholds.

The color-coded health status indicator in the UI of New Relic changes as the alerting threshold escalates or returns to normal. This allows you to monitor a situation through the UI before having a critical threshold, without needing to receive specific notifications about it. There're two incident thresholds: critical (red) and warning (yellow). Define these thresholds with different criteria, keeping in mind the above-mentioned suggestions.

Ensure your daily batch jobs run

You can set up an alert condition to receive a notification if your batch jobs fail to run.

Assuming you're sending an event to New Relic as part of your batch job, you can set up an alert condition to notify you if your batch jobs fail to run.

  1. Set up a simple count query on the event.

    SELECT count(*) FROM MyBatchEvent
  2. Set Loss of Signal to open a new incident after 24 hours and 30 minutes. You can adjust this, but it's a good idea to allow for a late-running batch job.

  3. Make sure to use the Event Timer streaming aggregation method. Since you'll only ever get 1 data point every 24 hours, you can set the timer to its lowest setting, 5 seconds.

Use non-null values when there's no signal

By default, gaps in data signals are filled with null values. In cases where you need to be able to create conditions based on those data gaps, you can fill gaps with a custom value or the last known value. You can configure this setting by condition in the UI or configure gap filling values via NerdGraph.

Important

Configuring gap-filling doesn't prevent the 'Loss of Signal' from triggering.

Define your issue creation preferences

Decide when you get issue notifications so you can respond to issues when they happen.

If you're new to alerts, learn more about your issue preferences options.

The default issue preference setting combines all conditions within a policy into one issue. Change your default issue preference setting to increase or decrease the number of issues and issue notifications you receive.

Each team within your organization can have different needs. Ask your team 2 important questions when deciding your issue preferences:

  • Do we want to be notified every time something goes wrong?
  • Do we want to group all similar notifications together and be notified once?

When a policy and its conditions have a broader scope, like managing the performance of several entities, increase the number of issues you receive. You can need more notifications because 2 issues can't necessarily relate to each other.

When a policy and its conditions have a focused scope, like managing the performance of one entity, opt for the default issue preference. You need fewer notifications when 2 issues are related to each other or when the team is already notified and fixing an existing problem.

What's next?

To learn more about using alerts:

Copyright © 2023 New Relic Inc.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.