This guide walks you through improving and optimizing the quality of your alerting. It's part of our series on observability maturity.
Teams suffer from alert fatigue when they experience high alert volumes and alerts that aren't aligned to business impact. As they start to perceive that many alerts are false and unhelpful, they may prioritize easy-to-resolve alerts over others. Also, they may close unresolved incidents so they can stay within their SLA targets.
The result will be slower incident responses, magnified issue scope, and increased severity when true business impacting issues occur.
This Alert quality management (AQM) implementation guide focuses on reducing the number of nuisance incidents so that you focus only on those alerts with true business impact. This reduces alert fatigue and ensures that you and your team focus your attention on the right places at the right times.
You're a good candidate for AQM if:
- You have too many alerts.
- You have alerts that stay open for long time periods.
- Your alerts are not relevant.
- Your customers discover your issues before your monitoring tools do.
- You can't see the value of your observability tool(s).
An alert strategy based on measuring business impact will result in faster response times and greater proactive awareness of critical events. An improved alert signal to noise ratio reduces confusion and improves rapid identification and problem isolation.
The overall goal of this Alert quality management practice is to ensure that fewer, more valuable, incidents are created, resulting in:
- Increased uptime and availability
- Reduced MTTR
- Decreased alert volume
- The ability to easily identify alerts that are not valuable, so you can either make them valuable or remove them.
The process described in this guide generates the key performance indicators and metrics that you will use to measure progress towards these goals. The metrics are measured in real time, published in a dashboard, and are used to drive a continuous improvement process that identifies and reduces nuisance alerts and increases user engagement in incident investigation.
Our Alert quality management practice does not encompass anomaly detection or AIOps, which are designed to detect unknown or unexpected modes of failure. The two practices (AQM and ML/AI) work hand in hand: they're not mutually exclusive.
You'll use the AQM process to collect and measure the following KPIs:
Incident volume, which includes:
- Incident count
- Accumulated incident time
- Mean time to close (MTTC)
- Percent under 5 minutes
User engagement, which includes:
- Mean time to investigate (MTTI)
- % of incidents investigated
These KPIs will help you to find the noisiest and least valuable alerts so you can improve their value or eliminate them. You will then use the long term metric trends to show real business impact to management and stakeholders. Detailed information on these metrics follows.
You should treat incidents (with or without alerts) like a queue of tasks. Just like a queue, the number of alerts should spend time near zero. Each incident should be a trigger for action to resolve the condition. If an alert does not result in action, then you should question the value of the alert condition.
If you see a constant rate of incidents or specific incidents that are "always on," then you should question why. Are you in a constant state of business impact, or do you simply have a large volume of noise? The alert volume KPIs help you to answer those questions and to measure progress towards a healthy state of high quality alerting.
You should measure the value of an incident by the amount of attention it receives. Engagement in this context is measured by whether or not an incident has been acknowledged.
The amount of engagement an individual alert receives is a direct measurement of its value. More engagement implies a valuable alert. Less (or zero) engagement implies a nuisance alert that should be modified or disabled.
There is a significant difference between measuring the moment of incident awareness vs. acknowledging the moment resolution activity begins. If you are using an integration with New Relic alerts, be sure that the "acknowledge" event that is sent to New Relic is triggered when resolution activity begins, not when the incident is sent to the external incident management tool. For more information regarding the standard incident management process, see "Incident management process: 5 steps to effective resolution, posted on August 31, 2020 by OnPage Corporation. -- in reference to ITIL4"
Before you begin, if you don't have equivalent experience, complete the New Relic University (NRU) Overview Course.
Also, you should have at least a basic understanding of:
- New Relic alert policy and conditions configuration
- New Relic incident notification channel webhook configuration
- NRQL (our query language)
- Our alerting best practices
- New Relic APM and infrastructure monitoring
- How to baseline data in order to determine anomalies versus normal behavior
As with any continuous improvement process, the first step of AQM is to establish the current state of your KPIs. To do so, perform the following tasks:
- Install and configure the incident event webhook
- Install the AQM dashboard
- Perform initial AQM orientation and enablement
- Accumulate AQM data
- Perform second enablement session
The webhook will create New Relic events for each incident as it proceeds through its lifecycle (open, acknowledge, close). To ensure that the AQM process generates accurate and valuable findings, this webhook must be added as a notification channel to every alert policy.
The AQM process requires incident, not violation data. This is why you will not be using the default
NrAiIncident event, which provides violation data only. Instead, you will use this webhook to send the required incident data to New Relic.
To use the webhook, do the following:
- Identify your primary production account and each of your accounts that you will be analyzing with the AQM process.
- Install the incident event webhook into each account that will participate in the AQM process and configure the webhook to report
nrAQMIncidentevents to your primary production account.
- Assign the webhook as a notification channel to every alert policy in each account.
This example shows a webhook notification channel assigned to each alert policy for a New Relic account with multiple sub-accounts.
The webhook, AQM dashboard, and detailed installation instructions can be found in the New Relic OMA resource center on GitHub.
The AQM dashboard is the primary asset that drives the AQM process. You need to install the AQM dashboard into the primary production account you identified in the Install and configure incident event webhook step you previously performed by doing the following:
- Download the dashboard definition JSON file from our observability maturity resource center on GitHub.
- Import the definition into your primary production account.
For more details on importing dashboards, see Introduction to dashboards.
During this phase, your incident management team(s) and other stakeholders will learn the goals of the AQM process and the scope of their involvement in it.
The most critical portion of this task is educating your team on the importance of acknowledging incident alerts, because that's how the alert's value is determined. In general, instruct them to follow these guidelines:
- If you look at an alert and decide to take any sort of further investigative action, acknowledge the alert.
- If you typically close an alert without doing anything else, do not acknowledge the alert.
- If the incident alert is always on, do not close or acknowledge it. For further details, see Second enablement session.
You can use the first session template presentation to communicate this material to your stakeholders.
The overall process requires at least two weeks of data before it can proceed. During this time, you should periodically check the following items:
- Confirm that incident alert event data is accumulating.
- Confirm that the webhook is attached to every alert policy.
- Ensure that incident responders are following the alert acknowledgement guidelines.
During this phase, you'll introduce incident management teams and other stakeholders to the initial AQM data and the ongoing continuous improvement process you'll be following.
The process consists of four activities:
- Review AQM dashboard and KPI trends: Here you and the stakeholders will look at the AQM KPIs and identify their week over week trends. The team should identify areas where KPIs are not improving and develop strategies to drive improvement.
- Identify achievements, challenges, and opportunities: Here you and the stakeholders will map the current state of alert quality to business impact, identifying areas where improvement has resulted in better business outcomes and areas where problems are impacting business outcomes.
- Incident policy review: Using the AQM dashboard, you and the stakeholders will identify the noisiest incident policies. Once identified, those policies should be evaluated as detailed in step 4 below.
- Alert policy recommendations: In this step, you and the stakeholders will review the noisiest policies using the following criteria:
- Do the alerts have any business impact?
- Are the policies properly configured?
- Are they telling us something about the resource that needs to be fixed?
- Are the policies necessary? Do they have business impact?
- Are the thresholds set properly?
- Technical recommendations: Here, you and the stakeholders will review any technical recommendations, including:
- Are there application / system problems for engineering to review?
- Are there poorly constructed policies that need to be fixed?
- Are there instrumentation gaps?
You can use the second session template presentation to keep this part of the AQM process organized.
This is the ongoing phase of the continuous improvement process where you periodically review your accumulated AQM data and make adjustments as needed to alert policies. You should perform this step once a week until your alert volume is acceptable. You can then perform it less frequently.
During this phase you should:
- Report your KPIs each week to upper management to ensure that the stakeholder teams are appropriately prioritizing the work and to show that progress towards the promised business outcomes are being reached.
- Record and retain your weekly KPIs over periods of months to years to establish a baseline and to show the rate of improvement.
You should keep in mind that this is a continuous improvment process. You will continue to collect and evaluate the KPIs over long periods of time to ensure you are meeting your AQM goals.
Once the AQM process is established, you will see significant reductions in the volume of alerts while reliability and stability remain the same or improve. In addition, you should see that your alerts have a clear and unambiguous business impact. Your AQM KPIs will provide quantifiable proof of these improvements.
Once you are firmly on the path to meeting the goals of AQM, consider moving to other use cases within the Uptime, performance, and reliability value stream, such as Service level management, or reliability engineering. You can also move to other observability maturity value streams, such as Customer experience.
Following are the descriptions of each KPI as well as sample NRQL queries that will extract them from the New Relic platform. These KPIs are also included in the AQM dashboard that can be downloaded from our observability maturity resource center on GitHub.
Want to get your hands dirty before you start implementing this in your account? Check out the alert quality management lab