Manage your alert quality

When teams receive too many alerts or too many false alarms, alert fatigue begins to occur. As either factor increases, that fatigue begins to have serious, negative consequences. Overwhelmed incident responders become used to false alerts, and the prioritize ones that are easier to resolve quickly instead of more serious issues. Worse, they often begin to simply close unresolved incidents to stay within response time targets. This means real alerts become lost in the noise while incident response times and severe outage occurances increase.

To fix alert fatigue and prevent it from occurring in the future, you must improve the quality of your alerts. Adopting a policy of alert quality management (AQM) focuses on reducing the number of nuisance incidents so that you focus only on with true business impact. This reduces alert fatigue and ensures that you and your team focus your attention on the right places at the right times.

You're a good candidate for AQM if:

You have too many alerts.
You have alerts that stay open for long time periods.
You have a lot of alerts aren't relevant.
Your customers discover your issues before your monitoring tools do.

Tip

Want to try a hands on learning approach before you start implementing this in your account? Check out the alert quality management lab.

Why use alert quality management?

When adopting practices based on alert quality management, you'll decrease response time and increase awareness of critical events. As you improve your alert signal-to-noise ratio, you'll lessen confusion and be able to rapidly identify and isolate the root cause of your problems. The goal is to reduce less valuable alerts while creating easier ways to identify when more valuable incidents occur. This results in:

Increased uptime and availability.
Reduced mean time to resolution (MTTR).
Decreased alert volume.
The ability to easily identify alerts that are not valuable, so you can either make them valuable or remove them.

Using key performance indicators

Using the right key performance indicators (KPIs) helps you to find the noisiest and least valuable alerts so you can improve their value or remove them. You'll use the AQM process to collect and measure incident volume and engagement KPIs, then use them to identify trends to fix issues that create serious problems. Below, you'll find information on all the KPIs, as well as a NRQL query for each one to help you monitor them from anywhere in the New Relic UI.

Incident volume

You should treat incidents (with or without alerts) like a queue of tasks. Just like a queue, the number of alerts should always be as close to zero as possible. Each incident should trigger an investigatory or corrective action to resolve the condition. If an alert doesn't result in some sort of action, then you should question the value of the alert condition.

In particular, if you see specific incidents that are frequently triggered, then you should question whether you're in a constant state of meaningful impact or if you simply have a large volume of noise. The incident volume KPIs help you answer those questions and measure progress towards a healthy state of high quality alerting.

This is the number of incidents generated over a period of time. Typically you should compare the current and previous weeks.

Goal: Reduce the number of low value and nuisance incidents.

Best practices:

Ensure condition settings are intended to detect real business impact.
Ensure condition settings are detecting abnormal behavior.
Use the incident details Acknowledge feature to help measure the value of alerts. See percentage incident acknowledge KPI.
Report AQM KPIs to all stakeholders.
```
FROM NrAiIncident SELECT count(*) AS 'Incident Count' WHERE event = 'open' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO
```

This is the total sum of minutes that all incidents have accumulated over a period of time. Typically you should compare the current and previous weeks.

Goal: Reduce the total accumulated minutes of incidents.

Best practices:

Don't manually close incidents, as doing so can warp the accuracy of this KPI.
Remove alerts that don't result in any remediation actions from the recipients.
Improve percent investigated and mean-time-to-investigate KPIs by communicating their importance in improving detection and response times.

Report AQM KPIs to all stakeholders.

FROM NrAiIncident SELECT sum(durationSeconds)/60 AS 'Incident Minutes' WHERE event = 'close' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

This is the average duration of incidents within the period of time measured. You want this number to be as low as possible.

Goal: Reduce MTTC

Best practices:

Don't manually close incidents, as doing so can warp the accuracy of this KPI.
Improve reliability engineering skills.

Report AQM KPIs to all stakeholders.

FROM NrAiIncident SELECT average(durationSeconds/60) AS 'Incident MTTC (minutes)' WHERE event = 'close' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

This is the percentage of incidents with a total duration less than five minutes. This can indicate an incident changing state too frequently, which obscures the cause and severity of the incident. This state is known as incident flapping.

Goal: Minimize percentage of incidents with short durations.

Best practices:

Ensure that conditions are detecting legitimate anomalies with meaningful impact on your system.

Understand service level management.

FROM NrAiIncident SELECT percentage(count(*), WHERE durationSeconds <= 5*60) AS '% Under 5min' WHERE event = 'close' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

User engagement

You should measure the value of an incident by the amount of attention it receives. The amount of engagement an individual alert receives is a direct measurement of its value. More engagement implies a valuable alert, while less (or zero) engagement implies an alert may simply be noisy and should be modified or disabled.

There's a significant difference between measuring the moment of incident awareness and acknowledging when resolution activity begins. If you're using an integration with New Relic alerts, be sure that the Acknowledge event sent to New Relic triggers when resolution activity begins, not when the incident is sent to the external incident management tool.

This identifies the percentage of incidents that with a true acknowledgement flag. You should compare the current and previous weeks.

Goal: Increase the percentage of incident engagement.

Best practices:

Ensure that your DevOps team knows when it's appropriate to acknowledge an incident alert, if applicable.
Gamify alert acknowledgement to drive usage.

Discourage mass acknowledgement exercises.

FROM NrAiIssue SELECT filter(count(*), WHERE event='acknowledge')/filter(count(*), WHERE event='create')*100 AS '% Investigated' WHERE priority='CRITICAL' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

This identifies the average time it takes for you to acknowledge an incident. Typically you should compare the current and previous weeks.

Goal: Reduce the mean time to investigate.

Best practices:

Work at building incident responder's confidence in alerts.
Ensure that valuable alerts are acknowledged.

Incentivize response teams to respond quickly to alerts.

FROM NrAiIssue SELECT average(acknowledgeTime - activateTime) / 60000 AS 'Incident MTTI (minutes)' WHERE event = 'acknowledge' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

What's next?

Once you implement the AQM process from the previous doc, you'll see significant reductions in the volume of alerts while maintaining reliability and stability. Your AQM KPIs can provide accurate inforation on these improvements when you follow the best practices as listed above.

Once you've finished implementing AQM, you can also look into improving and managing other aspects of your platform, such as:

Previous step

Learn how to improve your stack with alerts

Tip

Why use alert quality management?

Using key performance indicators

Incident volume

Incident count KPI

Accumulated incident duration KPI

Mean time to close (MTTC) KPI

Percent under 5 minutes KPI

User engagement

Percentage acknowledged KPI

Mean time to investigate (MTTI) KPI

What's next?

Previous step

Manage your alert quality

Tip

Why use alert quality management? .css-21sua1{background:none;border:none;width:0;padding:0;}

Using key performance indicators

Incident volume

Accumulated incident duration KPI

Mean time to close (MTTC) KPI

Percent under 5 minutes KPI

User engagement

Percentage acknowledged KPI

Mean time to investigate (MTTI) KPI

What's next?

Previous step

Why use alert quality management?