When teams receive too many alerts or too many false alarms, alert fatigue begins to occur. As either factor increases, that fatigue begins to have serious, negative consequences. Overwhelmed incident responders become used to false alerts, and the prioritize ones that are easier to resolve quickly instead of more serious issues. Worse, they often begin to simply close unresolved incidents to stay within response time targets. This means real alerts become lost in the noise while incident response times and severe outage occurances increase.
To fix alert fatigue and prevent it from occurring in the future, you must improve the quality of your alerts. Adopting a policy of alert quality management (AQM) focuses on reducing the number of nuisance incidents so that you focus only on with true business impact. This reduces alert fatigue and ensures that you and your team focus your attention on the right places at the right times.
You're a good candidate for AQM if:
- You have too many alerts.
- You have alerts that stay open for long time periods.
- You have a lot of alerts aren't relevant.
- Your customers discover your issues before your monitoring tools do.
Tip
Want to try a hands on learning approach before you start implementing this in your account? Check out the alert quality management lab.
Why use alert quality management?
When adopting practices based on alert quality management, you'll decrease response time and increase awareness of critical events. As you improve your alert signal-to-noise ratio, you'll lessen confusion and be able to rapidly identify and isolate the root cause of your problems. The goal is to reduce less valuable alerts while creating easier ways to identify when more valuable incidents occur. This results in:
- Increased uptime and availability.
- Reduced mean time to resolution (MTTR).
- Decreased alert volume.
- The ability to easily identify alerts that are not valuable, so you can either make them valuable or remove them.
Using key performance indicators
Using the right key performance indicators (KPIs) helps you to find the noisiest and least valuable alerts so you can improve their value or remove them. You'll use the AQM process to collect and measure incident volume and engagement KPIs, then use them to identify trends to fix issues that create serious problems. Below, you'll find information on all the KPIs, as well as a NRQL query for each one to help you monitor them from anywhere in the New Relic UI.
Incident volume
You should treat incidents (with or without alerts) like a queue of tasks. Just like a queue, the number of alerts should always be as close to zero as possible. Each incident should trigger an investigatory or corrective action to resolve the condition. If an alert doesn't result in some sort of action, then you should question the value of the alert condition.
In particular, if you see specific incidents that are frequently triggered, then you should question whether you're in a constant state of meaningful impact or if you simply have a large volume of noise. The incident volume KPIs help you answer those questions and measure progress towards a healthy state of high quality alerting.
User engagement
You should measure the value of an incident by the amount of attention it receives. The amount of engagement an individual alert receives is a direct measurement of its value. More engagement implies a valuable alert, while less (or zero) engagement implies an alert may simply be noisy and should be modified or disabled.
There's a significant difference between measuring the moment of incident awareness and acknowledging when resolution activity begins. If you're using an integration with New Relic alerts, be sure that the Acknowledge
event sent to New Relic triggers when resolution activity begins, not when the incident is sent to the external incident management tool.
What's next?
Once you implement the AQM process from the previous doc, you'll see significant reductions in the volume of alerts while maintaining reliability and stability. Your AQM KPIs can provide accurate inforation on these improvements when you follow the best practices as listed above.
Once you've finished implementing AQM, you can also look into improving and managing other aspects of your platform, such as: