• /
  • Log in
  • Free account

Alert quality management use case implementation guide

Overview

Teams suffer from alert fatigue when they experience high alert volumes and alerts that are not aligned to business impact. As they start to believe that most alerts are false, they may prioritize easy to resolve alerts over others.  Also, they may close unresolved incidents so they can stay within their SLA targets.

The result will be slower incident responses, magnified issue scope, and increased severity when true business impacting issues occur.

Alert Quality Management (AQM) focuses on reducing the number of nuisance incidents so that you focus only on alerts with true business impact.  This reduces alert fatigue and ensures that you and your team focus your attention on the right places at the right times.

You are a good candidate for AQM if:

  • You have too many alerts.
  • You have alerts that stay open for long time periods.
  • Your alerts are not relevant.
  • Your customers discover your issues before your monitoring tools do.
  • You can't see the value of your observability tool(s).

Desired outcome

An alert strategy based on measuring business impact will result in faster response times and greater proactive awareness of critical events.  An improved alert signal to noise ratio reduces confusion and improves rapid identification and problem isolation.

AQM's overall goal is to ensure that fewer, more valuable, incidents are created, resulting in:

  • Increased uptime and availability
  • Reduced MTTR
  • Decreased alert volume
  • The ability to easily identify alerts that are not valuable, so you can either make them valuable or remove them.

The AQM process described in this guide generates the key performance indicators and metrics that you will use to measure progress towards these goals.  The metrics are measured in real time, published in a dashboard, and are used to drive a continuous improvement process that identifies and reduces nuisance alerts and increases user engagement in incident investigation.

AQM does not encompass anomaly detection or AIOps, which are designed to detect unknown or unexpected modes of failure.  The two practices (AQM and ML/AI) work hand in hand, they are not mutually exclusive.

Key performance indicators

You will use the AQM process to collect and measure the following KPIs:

Incident volume

  • Incident Count
  • Accumulated incident time
  • Mean Time to Close (MTTC)
  • Percent Under 5 Minutes

User Engagement

  • Mean Time to Investigate  (MTTI)
  • % of Incidents Investigated

These KPIs will help you to find the noisiest and least valuable alerts so you can improve their value or eliminate them.  You will then use the long term metric trends to show real business impact to management and stakeholders.  Detailed information on each metric follows.

Incident volume

You should treat incidents (with or without alerts) like a queue of tasks.  Just like a queue, the number of alerts should spend time near zero. Each incident should be a trigger for action to resolve the condition.  If an alert does not result in action, then you should question the value of the alert condition.

If you see a constant rate of incidents or specific incidents that are "always-on", then you should question why. Are you in a constant state of business impact, or do you simply have a large volume of noise? The alert volume KPIs help you to answer those questions and to measure progress towards a healthy state of high quality alerting.

User engagement

You should measure the value of an incident by the amount of attention it receives.  Engagement in this context is measured by whether or not an incident has been acknowledged.

The amount of engagement an individual alert receives is a direct measurement of its value.  More engagement implies a valuable alert, less (or zero) engagement implies a nuisance alert that should be modified or disabled.

There is a significant difference between measuring the moment of incident awareness vs. acknowledging the moment resolution activity begins. If you are using an integration with New Relic Alerting, be sure that the "acknowledge" event that is sent to New Relic is triggered when resolution activity begins, not when the incident is sent to the external incident management tool.  For more information regarding the standard Incident Management process, see "Incident Management Process: 5 Steps to Effective Resolution Posted on August 31, 2020 by OnPage Corporation. -- in reference to ITIL4"

Prerequisites

Before you begin, if you don't have equivalent experience, complete the New Relic University (NRU) Overview Course. Also, make sure you have a basic understanding of:

  • NR1 Alert policy and conditions configuration
  • NR1 incident notification channel webhook configuration
  • NR1 NRQL
  • NR1 alerting best practices
  • NR1 APM & Infrastructure
  • How to baseline data in order to determine anomalies vs. normal behavior.

Establish current state

As with any continuous improvement process, the first step of AQM is to establish the current state of your KPIs. To do so, perform the following tasks:

Install and configure the incident event webhook

The webhook will create New Relic events for each incident as it proceeds through its lifecycle (open, acknowledge, close). To ensure that the AQM process generates accurate and valuable findings, this webhook must be added as a notification channel to every alert policy.

The AQM process requires incident, not violation data. This is why you will not be using the default NrAiIncident event, which provides violation data only.  Instead, you will use this webhook to send the required incident data to New Relic.

To use the webhook, do the following:

  1. Identify your primary production account and each of your accounts that you will be analyzing with the AQM process.
  2. Install the incident event webhook into each account that will participate in the AQM process and configure the webhook to report nrAQMIncident events to your primary production account.
  3. Assign the webhook as a notification channel to every alert policy in each account.

Incident alert webhook for alert quality management in New Relic

This example shows a webhook notification channel assigned to each alert policy for a New Relic account with multiple sub-accounts.

The webhook, AQM dashboard, and detailed installation instructions can be found in the New Relic OMA resource center on GitHub.

Install the AQM dashboard

The AQM dashboard is the primary asset that drives the AQM process.  You need to install the AQM dashboard into the primary production account you identified in the "Install and configure incident event webhook" step you previously performed by doing the following:

  1. Download the dashboard definition JSON file from the New Relic OMA resource center GitHub repo.
  2. Import the definition into your primary production account.

For more details on importing dashboards, see the New Relic Introduction to dashboards documentation

Perform initial AQM orientation and enablement

During this phase, your incident management team(s) and other stakeholders will learn the goals of the AQM process and the scope of their involvement in it.

The most critical portion of this task is educating your team on the importance of acknowledging incident alerts, since that's how the alert's value is determined.  In general, instruct them to follow these guidelines:

  • If you look at an alert and decide to take any sort of further investigative action, acknowledge the alert.
  • If you typically close an alert without doing anything else, do not acknowledge the alert.
  • If the incident alert is always on, do not close or acknowledge it. For further details, see Second Enablement Session.

You can use the first session template presentation to communicate this material to your stakeholders.

Accumulate AQM data

The overall process requires at least two weeks of data before it can proceed.  During this time, you should periodically check the following items:

  • Confirm that incident alert event data is accumulating.
  • Confirm that the webhook is attached to every alert policy.
  • Ensure that incident responders are following the alert acknowledgement guidelines.

Perform second enablement session

During this phase, you will introduce incident management teams and other stakeholders to the initial AQM data and the ongoing continuous  improvement process you'll be following.

The process consists of four activities:

  1. Review AQM Dashboard and KPI Trending: Here you and the stakeholders will look at the AQM KPIs and identify their week over week trends.  The team should identify areas where KPIs are not improving and develop strategies to drive improvement.
  2. Identify Achievements, Challenges, and Opportunities: Here you and the stakeholders will map the current state of alert quality to business impact, identifying areas where improvement has resulted in better business outcomes and areas where problems are impacting business outcomes.
  3. Incident Policy Review: Using the AQM dashboard, you and the stakeholders will identify the noisiest incident policies.  Once identified, those policies should be evaluated as detailed in step 4 below.
  4. Alert Policy Recommendations: In this step, you and the stakeholders will review the noisiest policies using the following criteria:
    • Do the alerts have any business impact?
    • Are the policies properly configured?
      • Are they telling us something about the resource that needs to be fixed?
      • Are the policies necessary? Do they have business impact?
      • Are the thresholds set properly?
  5. Technical recommendations: Here, you and the stakeholders will review any technical recommendations, including:
    • Are there application / system problems for engineering to review?
    • Are there poorly constructed policies that need to be fixed?
    • Are there instrumentation gaps?

You can use the second session template presentation to keep this part of the AQM process organized.

Improvement process

This is the ongoing phase of the continuous improvement process where you periodically review your accumulated AQM data and make adjustments as needed to alert policies. You should perform this step once a week until your alert volume is acceptable. You can then perform it less frequently.

During this phase you should:

  • Report your KPIs each week to upper management to ensure that the stakeholder teams are appropriately prioritizing the work and to show that progress towards the promised business outcomes are being reached.

  • Record and retain your weekly KPIs  over periods of months to years to establish a baseline and to show the rate of improvement.

You should keep in mind that this is a continuous improvment process, you will continue to collect and evaluate the KPIs over long periods of time to ensure you are meeting your AQM goals.

Value realization

Once the AQM process is established, you will see significant reductions in the volume of alerts while reliability and stability remain the same or improve.  In addition, you should see that your alerts have a clear and unambiguous business impact.  Your AQM KPIs will provide quantifiable proof of these improvements.

Once you are firmly on the path to AQM's goals, consider moving to other use cases within the Uptime, Performance, and Reliability value stream, such as Service Level Management, or Reliability Engineering.  You can also move to other observability maturity value streams, such as Customer Experience.

KPI reference

Following are the descriptions of each KPI as well as sample NRQL queries that will extract them from the New Relic platform.  These KPIs are also included in the AQM dashboard that can be downloaded from the New Relic OMA resource center GitHub repo.

Incident volume

Incident engagement

Additional resources

Want to get your hands dirty before you start implementing this in your account? Check out the alert quality management lab

For more help

If you need more help, check out these support and learning resources:

Create issueEdit page
Copyright © 2021 New Relic Inc.