• EnglishEspañol日本語한국어Português
  • Log inStart now

Alert quality management: Optimize your alerts and reduce alert fatigue

This guide walks you through improving and optimizing the quality of your alerting. It's part of our series on observability maturity.

Overview

Teams suffer from alert fatigue when they experience high alert volumes and alerts that aren't aligned to business impact. High alert volumes train incident responders to assume the alerts are false and have no business impact. In turn, they may start to prioritize easy-to-resolve alerts over others and they may close unresolved incidents so they can stay within their SLA targets. This results in slower incident response and increased scope and severity when true business impacting issues occur.

Alert quality management (AQM) focuses on reducing the number of nuisance incidents so that you focus only on alerts with true business impact. This reduces alert fatigue and ensures that you and your team focus your attention on the right places at the right times.

You're a good candidate for AQM if:

  • You have too many alerts.
  • You have alerts that stay open for long time periods.
  • Your alerts are not relevant.
  • Your customers discover your issues before your monitoring tools do.
  • You can't see the value of your observability tool(s).

Desired outcome

By using an alert strategy based on measuring business impact, you'll decrease response time and increase awareness of critical events. As you improve your alert signal to noise ratio, you'll reduce confusion and be able to rapidly identify and isolate the root cause of your problems.

The overall goal of alert quality management is to ensure that fewer, more valuable, incidents are created. This will result in:

  • Increased uptime and availability
  • Reduced mean time to resolution (MTTR)
  • Decreased alert volume
  • The ability to easily identify alerts that are not valuable, so you can either make them valuable or remove them.

This guide steps you through the process of generating the key performance indicators that you will use to measure progress towards these goals. The KPIs are measured in real time, published in a dashboard, and are used to drive a continuous improvement process you'll use to identify and reduce nuisance alerts and increase engagement in incident investigation.

The alert quality management practice does not encompass anomaly detection or AIOps, which are designed to detect unknown or unexpected modes of failure. The two practices (AQM and ML/AI) work hand in hand, they're not mutually exclusive.

Key performance indicators

Note: Prior releases of this implementation guide relied on a custom event (NrAQMIncident) which was generated by a webhook to collect the required KPIs. That dependency has changed. AQM now uses the default NrAiIncident and NrAiIssue event types instead. See the AQM migration guide section for more details.

You'll use the AQM process to collect and measure incident volume and engagement KPIs.

Incident volume, which includes:

  • Incident count
  • Accumulated incident time
  • Mean time to close (MTTC)
  • Percent under 5 minutes

Incident engagement, which includes:

  • Mean time to investigate (MTTI)
  • % of incidents investigated

These KPIs will help you to find the noisiest and least valuable alerts so you can improve their value or eliminate them. You will then use the long term metric trends to show real business impact to management and stakeholders.

Detailed information on these metrics follows.

Incident volume

You should treat incidents (with or without alerts) like a queue of tasks. Just like a queue, the number of alerts should spend time near zero. Each incident should be a trigger for an investigatory or corrective action to resolve the condition. If an alert does not result in some sort of action, then you should question the value of the alert condition.

In particular, if you see specific incidents that are "always on", then you should question why. Are you in a constant state of business impact, or do you simply have a large volume of noise? The alert volume KPIs help you to answer those questions and to help you measure progress towards a healthy state of high quality alerting.

User engagement

You should measure the value of an incident by the amount of attention it receives. Here, we measure engagement by whether or not an incident has been acknowledged.

The amount of engagement an individual alert receives is a direct measurement of its value. More engagement implies a valuable alert. Less (or zero) engagement implies a nuisance alert that should be modified or disabled.

There is a significant difference between measuring the moment of incident awareness vs. acknowledging the moment resolution activity begins. If you are using an integration with New Relic alerts, be sure that the "acknowledge" event that is sent to New Relic is triggered when resolution activity begins, not when the incident is sent to the external incident management tool.

For more information regarding standard incident management processes, see "Incident management process: 5 steps to effective resolution, posted on August 31, 2020 by OnPage Corporation. in reference to ITIL4"

Prerequisites

Before you begin, if you don't have equivalent experience, complete the New Relic University (NRU) Overview Course.

Also, you should have at least a basic understanding of:

  • New Relic alert policy and conditions configuration
  • NRQL (our query language)
  • Our alerting best practices
  • New Relic and infrastructure monitoring
  • How to baseline data in order to determine anomalies versus normal behavior

Establish current state

As with any continuous improvement process, AQM's first step is to establish the current state of your KPIs. To do so, perform the following tasks:

Note: Prior releases of this implementation guide relied on a custom event (nrAQMIncident) which was generated by a webhook to collect the required KPIs. That dependency has changed. AQM now uses the default NrAiIncident and NrAiIssue event types instead. See the AQM migration guide section for more details.

Install the AQM dashboard

The AQM dashboard is the primary asset that drives the AQM process. You need to install the AQM dashboard from our quickstart library by doing the following:

  1. Go to the Alert Quality Management instant observability page.
  2. Click on the Install now button.
  3. Follow the prompts to choose the account you want to install the dashboard into.
  4. View your dashboard.

The JSON dashboard definition is also available in the observability maturity resource center on GitHub.. This version is identical to the one found in the quickstarts. If you're going to use the GitHub version, do the following:

  1. Update the accountId property to reflect the ID of the account you're importing the dashboard into. There are 13 instances of that property that need to be updated from 0000000 to the correct account ID.
  2. Import the definition into your account.

For more details on importing dashboards, see Introduction to dashboards.

Analyze event volume KPIs

After the dashboard is installed, you can immediately use it to analyze the four event volume KPIs.

Use the Alerting Count by Policy pane in the dashboard to find policies that generate high incident counts, high cumulative incident durations, long MTT-close, or a high percentage of incidents open for less than 5 minutes. In particular, you should identify policies:

  • That generate "always-on" incidents (i.e. incidents with thousands of minutes or more of cumulative duration).
  • Where 30% or more of incidents are open for less than 5 minutes.
  • Whose MTT-close is longer than 30 minutes.
  • That create more than 350 incidents per week.

You should identify the top four policies that fit one or more of those criteria, review the policy conditions, and look at the resulting incident details to determine patterns. You should then ensure that the team(s) that created those policies and the team(s) that respond to the incidents those polices participate in the AQM enablement process.

Perform initial AQM orientation and enablement

During this phase, your incident management team(s) and other stakeholders will learn the goals of the AQM process, the scope of their involvement in it, and participate in an initial review of the policies that you identified in the event volume analysis step.

You can use the first session template presentation to communicate this material to your stakeholders.

The AQM dashboard will provide you with an initial baseline of your incident volume KPIs which you should use as the starting point for the continuous improvement process. In addition, you should talk to the appropriate teams about ways to change the top 4 nosiest incident policies so they are more valuable. As you review the noisy policies, you should ask the following questions about each policy in the following order:

  1. Are the alerts telling us something about a resource that needs to be fixed? If so, then fix the problem and see if the alert volume decreases.
  2. Are the alerts telling us about something that actually requires an immediate response? If not, then adjust or disable the policy.
  3. Are the policy thresholds set properly? If not, then consider adjusting the thresholds.

One of the most significant parts of this task is educating your team on the importance of acknowledging incident alerts, because that's how an alert's value is determined. In general, instruct them to follow these guidelines:

  • If you look at an alert and decide to take any sort of further investigative action, acknowledge the alert.
  • If you typically close an alert without doing anything else, do not acknowledge the alert.
  • If the incident alert is always on, do not close or acknowledge it. For further details, see Second enablement session.

Accumulate engagement data

The process of measuring alert engagement requires at least two weeks of data before it can proceed. During this time, you should periodically check to see if incident responders are following the alert acknowledgement guidelines outlined in the the first enablement session.

Perform second enablement session

You should perform the second enablement session two weeks after the first. During this phase, you'll introduce incident management teams and other stakeholders to the AQM trend data and the ongoing continuous improvement process you'll be following.

The process consists of four activities:

  1. Review AQM dashboard and KPI trends: Here you and the stakeholders will look at the AQM KPIs and identify their week over week trends. The team should identify areas where KPIs are not improving and develop strategies to drive improvement.
  2. Identify achievements, challenges, and opportunities: Here you and the stakeholders will map the current state of alert quality to business impact, identifying areas where improvement has resulted in better business outcomes and areas where problems are impacting business outcomes.
  3. Incident policy review: Using the AQM dashboard, you and the stakeholders will identify the noisiest incident policies. Once identified, those policies should be evaluated as detailed in step 4 below.
  4. Alert policy recommendations: In this step, you and the stakeholders will review the noisiest policies using the following criteria:
  • Do the alerts have any business impact?
  • Are the policies properly configured?
    • Are they telling us something about the resource that needs to be fixed?
    • Are the policies necessary? Do they have business impact?
    • Are the thresholds set properly?
  1. Technical recommendations: Here, you and the stakeholders will review any technical recommendations, including:
    • Are there application / system problems for engineering to review?
    • Are there poorly constructed policies that need to be fixed?
    • Are there instrumentation gaps?

You can use the second session template presentation to keep this part of the AQM process organized.

During this phase, you should also take the opportunity to show the value of AQM by reviewing the four nosiest policies you identified in the first enablement session and highlighting how their KPIs have improved.

Improvement process

This is the ongoing phase of the continuous improvement process where you periodically review your accumulated AQM data and make adjustments as needed to alert policies. You should perform this step on a weekly or bi-weekly basis until your alert volume is acceptable. You can then perform it less frequently.

During this phase you should:

  • Report your KPIs each week to upper management to ensure that the stakeholder teams are appropriately prioritizing the work and to show progress towards the promised business outcomes.
  • Record and retain your weekly KPIs over periods of months to years to establish a baseline and to show the rate of improvement.

You should keep in mind that this is a continuous improvement process. You will continue to collect and evaluate the KPIs over long periods of time to ensure you are meeting your AQM goals.

Value realization

Once the AQM process is established, you will see significant reductions in the volume of alerts while reliability and stability remain the same or improve. In addition, you should see that your alerts have a clear and unambiguous business impact. Your AQM KPIs will provide quantifiable proof of these improvements.

This process should also help alert creators to better understand the impact that new alert policies put on responders and help alert creators to build more meaningful alert policies.

Once you are firmly on the path to meeting the goals of AQM, consider moving to other use cases within the Uptime, performance, and reliability value stream, such as Service level management, or reliability engineering. You can also move to other observability maturity value streams, such as Customer experience.

KPI reference

Following are the descriptions of each KPI as well as sample NRQL queries that will extract them from the New Relic platform. These KPIs are also included in the AQM dashboard that can be downloaded from our observability maturity resource center on GitHub.

Incident volume

Incident engagement

Acknowledging alerts

In order for the engagement KPIs to be meaningful, it's necessary to acknowledge incidents either in the UI or through an API. Often, in a test environment this is not done. That can lead to no valid data being returned. If critical alerts are not acknowledged or rarely acknowledged in a production environment, this can be a sign of low value alerts, "flapping" alerts, or alerts that are not well routed to the right engineers.

Additional resources

Want to get your hands dirty before you start implementing this in your account? Check out the alert quality management lab

AQM Migration Guide

The original release of AQM leveraged a custom event called nrAQMIncident to drive the process. The nrAQMIncident event was generated by a webhook associated with each alert policy in the account and generated both incident volume and engagement KPIs.

The current version of AQM uses two default events, NrAiIncident and NrAiIssue, which are automatically generated.

The legacy webhook method used by the original release of AQM is going to be deprecated in January, 2023. You should transition to the new methodology prior to that time. If your organization is adopting AQM for the first time, you should use the new, NrAiIncident, based methodology.

When you transition, you should be aware of the following:

  • Your existing AQM dashboard will need to be replaced with the new AQM dashboard that uses the NrAiIncident / NrAiIssue events.
  • These events are required to generate the incident volume KPIs and incident engagement KPIs.
  • As you transition from nrAQMIncident to NrAiIncident, your KPI numbers will change substantially. This is because the NrAiIncident events track the number of times threshold incidents have occurred, while the nrAQMIncident events tracked alert notifications. While the numbers have changed, the underlying relationships with the KPIs and the relative values between them have not. You will need to educate the AQM participants on this to reduce the risks of confusion.

Once your transition is complete, you can disable or delete the AQM webhook and the old dashboard.

Copyright © 2024 New Relic Inc.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.