• Log inStart now

Alert quality management: optimize your alerts and reduce alert fatigue

This guide walks you through improving and optimizing the quality of your alerting. It's part of our series on observability maturity. It has been updated to reflect the changed alerting architecture and to use the stock NrAiIncident event type.

Overview

Teams suffer from alert fatigue when they experience high alert volumes and alerts that aren't aligned to business impact. High alert volumes train incident responders to assume the alerts are false and have no business impact. In turn, they may start to prioritize easy-to-resolve alerts over others and they may close unresolved incidents so they can stay within their SLA targets. This results in slower incident response and increased scope and severity when true business impacting issues occur.

Alert quality management (AQM) focuses on reducing the number of nuisance incidents so that you focus only on alerts with true business impact. This reduces alert fatigue and ensures that you and your team focus your attention on the right places at the right times.

You're a good candidate for AQM if:

  • You have too many alerts.
  • You have alerts that stay open for long time periods.
  • Your alerts are not relevant.
  • Your customers discover your issues before your monitoring tools do.
  • You can't see the value of your observability tool(s).

Here's a short video (3:34) with an introduction to alert quality management concepts:

Desired outcome

By using an alert strategy based on measuring business impact, you'll decrease response time and increase awareness of critical events. As you improve your alert signal to noise ratio, you'll reduce confusion and be able to rapidly identify and isolate the root cause of your problems.

The overall goal of alert quality management is to ensure that fewer, more valuable, incidents are created. This will result in:

  • Increased uptime and availability
  • Reduced MTTR
  • Decreased alert volume
  • The ability to easily identify alerts that are not valuable, so you can either make them valuable or remove them.

This guide steps you through the process of generating the key performance indicators that you will use to measure progress towards these goals. The KPIs are measured in real time, published in a dashboard, and are used to drive a continuous improvement process you'll use to identify and reduce nuisance alerts and increase engagement in incident investigation.

The alert quality management practice does not encompass anomaly detection or AIOps, which are designed to detect unknown or unexpected modes of failure. The two practices (AQM and ML/AI) work hand in hand, they're not mutually exclusive.

Key performance indicators

Note: Prior releases of this implementation guide relied on a custom event (nrAQMIncident) which was generated by a webhook to collect the required KPIs. That dependency has changed. AQM now uses the default NrAiIncident event type instead. NrAiIncident does not yet expose incident engagement KPIs, therefore all references to incident engagement have been removed from AQM. See the AQM migration guide section for more details.

You'll use the AQM process to collect and measure incident volume KPIs. These KPIs will help you to find the noisiest and least valuable alerts so you can improve their value or eliminate them. You will then use the long term metric trends to show real business impact to management and stakeholders.

You should treat incidents (with or without alerts) like a queue of tasks. Just like a queue, the number of alerts should spend time near zero. Each incident should be a trigger for an investigatory or corrective action to resolve the condition. If an alert does not result in some sort of action, then you should question the value of the alert condition.

In particular, if you see specific incidents that are "always on", then you should question why. Are you in a constant state of business impact, or do you simply have a large volume of noise? The alert volume KPIs help you to answer those questions and to help you measure progress towards a healthy state of high quality alerting.

Detailed information on these metrics follows.

Prerequisites

Before you begin, if you don't have equivalent experience, complete the New Relic University (NRU) Overview Course.

Also, you should have at least a basic understanding of:

  • New Relic alert policy and conditions configuration
  • NRQL (our query language)
  • Our alerting best practices
  • New Relic APM and infrastructure monitoring
  • How to baseline data in order to determine anomalies versus normal behavior

Establish current state

As with any continuous improvement process, AQM's first step is to establish the current state of your KPIs. To do so, perform the following tasks:

Note: Prior releases of this implementation guide relied on a custom event (nrAQMIncident) which was generated by a webhook to collect the required KPIs. That dependency has changed. AQM now uses the default NrAiIncident event type instead. See the AQM migration guide section for more details.

Install the AQM dashboard

The AQM dashboard is the primary asset that drives the AQM process. You need to install the AQM dashboard by doing the following:

  1. Download the dashboard definition JSON file from our observability maturity resource center on GitHub.
  2. Update the accountId property to reflect the ID of the account you're importing the dashboard into. There are 13 instances of that property that need to be updated from 0000000 to the correct account ID.
  3. Import the definition into your account.

For more details on importing dashboards, see Introduction to dashboards.

Analyze event volume KPIs

After the dashboard is installed, you can immediately use it to analyze the four event volume KPIs.

Use the Alerting Count by Policy pane in the dashboard to find policies that generate high incident counts, high cumulative incident durations, long MTT-close, or a high percentage of incidents open for less than 5 minutes. In particular, you should identify policies:

  • That generate "always-on" incidents (i.e. incidents with thousands of minutes or more of cumulative duration).
  • Where 30% or more of incidents are open for less than 5 minutes.
  • Whose MTT-close is longer than 30 minutes.
  • That create more than 350 incidents per week.

You should identify the top four policies that fit one or more of those criteria, review the policy conditions, and look at the resulting incident details to determine patterns. You should then ensure that the team(s) that created those policies and the team(s) that respond to the incidents those polices create participate in the AQM enablement process.

Perform initial AQM orientation and enablement

During this phase, your incident management team(s) and other stakeholders will learn the goals of the AQM process, the scope of their involvement in it, and participate in an initial review of the policies that you identified in the event volume analysis step.

You can use the first session template presentation to communicate this material to your stakeholders.

The AQM dashboard will provide you with an initial baseline of your incident volume KPIs which you should use as the starting point for the continuous improvement process. In addition, you should talk to the appropriate teams about ways to change the top 4 nosiest incident policies so they are more valuable. As you review the noisy policies, you should ask the following questions about each policy in the following order:

  1. Are the alerts telling us something about a resource that needs to be fixed? If so, then fix the problem and see if the alert volume decreases.
  2. Are the alerts telling us about something that actually requires an immediate response? If not, then adjust or disable the policy.
  3. Are the policy thresholds set properly? If not, then consider adjusting the thresholds.

Perform second enablement session

You should perform the second enablement session two weeks after the first. During this phase, you'll introduce incident management teams and other stakeholders to the AQM trend data and the ongoing continuous improvement process you'll be following.

The process consists of four activities:

  1. Review AQM dashboard and KPI trends: Here you and the stakeholders will look at the AQM KPIs and identify their week over week trends. The team should identify areas where KPIs are not improving and develop strategies to drive improvement.
  2. Identify achievements, challenges, and opportunities: Here you and the stakeholders will map the current state of alert quality to business impact, identifying areas where improvement has resulted in better business outcomes and areas where problems are impacting business outcomes.
  3. Incident policy review: Using the AQM dashboard, you and the stakeholders will identify the noisiest incident policies. Once identified, those policies should be evaluated as detailed in step 4 below.
  4. Alert policy recommendations: In this step, you and the stakeholders will review the noisiest policies using the following criteria:
  • Do the alerts have any business impact?
  • Are the policies properly configured?
    • Are they telling us something about the resource that needs to be fixed?
    • Are the policies necessary? Do they have business impact?
    • Are the thresholds set properly?
  1. Technical recommendations: Here, you and the stakeholders will review any technical recommendations, including:
    • Are there application / system problems for engineering to review?
    • Are there poorly constructed policies that need to be fixed?
    • Are there instrumentation gaps?

You can use the second session template presentation to keep this part of the AQM process organized.

During this phase, you should also take the opportunity to show the value of AQM by reviewing the four nosiest policies you identified in the first enablement session and highlighting how their KPIs have improved.

Improvement process

This is the ongoing phase of the continuous improvement process where you periodically review your accumulated AQM data and make adjustments as needed to alert policies. You should perform this step on a weekly or bi-weekly basis until your alert volume is acceptable. You can then perform it less frequently.

During this phase you should:

  • Report your KPIs each week to upper management to ensure that the stakeholder teams are appropriately prioritizing the work and to show progress towards the promised business outcomes.
  • Record and retain your weekly KPIs over periods of months to years to establish a baseline and to show the rate of improvement.

You should keep in mind that this is a continuous improvement process. You will continue to collect and evaluate the KPIs over long periods of time to ensure you are meeting your AQM goals.

Value realization

Once the AQM process is established, you will see significant reductions in the volume of alerts while reliability and stability remain the same or improve. In addition, you should see that your alerts have a clear and unambiguous business impact. Your AQM KPIs will provide quantifiable proof of these improvements.

This process should also help alert creators to better understand the impact that new alert policies put on responders and help alert creators to build more meaningful alert policies.

Once you are firmly on the path to meeting the goals of AQM, consider moving to other use cases within the Uptime, performance, and reliability value stream, such as Service level management, or reliability engineering. You can also move to other observability maturity value streams, such as Customer experience.

KPI reference

Following are the descriptions of each KPI as well as sample NRQL queries that will extract them from the New Relic platform. These KPIs are also included in the AQM dashboard that can be downloaded from our observability maturity resource center on GitHub.

Incident volume

Additional resources

Want to get your hands dirty before you start implementing this in your account? Check out the alert quality management lab

AQM Migration Guide

The original release of AQM leveraged a custom event called nrAQMIncident to drive the process. The nrAQMIncident event was generated by a webhook associated with each alert policy in the account and generated both incident volume and engagement KPIs.

The current version of AQM uses a default event called NrAiIncident which is automatically generated, but does not generate incident engagement (i.e. MTT-Investigate and percent investigated) KPIs.

If engagement KPIs are important to you, you should continue to use the webhook based methodology. If the webhook maintenance tasks add too much toil, you can transition to the new (NrAiIncident) methodology. If your organization is adopting AQM for the first time, you should use the new, NrAiIncident, based methodology.

If you opt to transition, you should be aware of the following:

  • Your existing AQM dashboard will need to be replaced with the new AQM dashboard that uses the NrAiIncident events.
  • As you transition from nrAQMIncident to NrAiIncident, your KPI numbers will change substantially. This is because the NrAiIncident events track the number of times threshold violations have occurred, while the nrAQMIncident events tracked alert notifications. While the numbers have changed, the underlying relationships with the KPIs and the relative values between them have not. You will need to educate the AQM participants on this to reduce the risks of confusion.
  • You will no longer see incident engagement KPIs.

Once your transition is complete, you can disable or delete the AQM webhook and the old dashboard.

Copyright © 2022 New Relic Inc.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.