Manage your alert quality

When teams receive too many alerts or too many false alarms, alert fatigue begins to occur. As either factor increases, that fatigue begins to have serious, negative consequences. Overwhelmed alert event responders become used to false alerts, and the prioritize ones that are easier to resolve quickly instead of more serious issues. Worse, they often begin to simply close unresolved alert events to stay within response time targets. This means real alerts become lost in the noise while alert event response times and severe outage occurances increase.

To fix alert fatigue and prevent it from occurring in the future, you must improve the quality of your alerts. Adopting a policy of alert quality management (AQM) focuses on reducing the number of nuisance alert events so that you focus only on with true business impact. This reduces alert fatigue and ensures that you and your team focus your attention on the right places at the right times.

You're a good candidate for AQM if:

You have too many alerts.
You have alerts that stay open for long time periods.
You have a lot of alerts aren't relevant.
Your customers discover your issues before your monitoring tools do.

Tip

Want to try a hands on learning approach before you start implementing this in your account? Check out the alert quality management course.

Why use alert quality management?

When adopting practices based on alert quality management, you'll decrease response time and increase awareness of critical events. As you improve your alert signal-to-noise ratio, you'll lessen confusion and be able to rapidly identify and isolate the root cause of your problems. The goal is to reduce less valuable alerts while creating easier ways to identify when more valuable alert events occur. This results in:

Increased uptime and availability.
Reduced mean time to resolution (MTTR).
Decreased alert volume.
The ability to easily identify alerts that are not valuable, so you can either make them valuable or remove them.

Using key performance indicators

Using the right key performance indicators (KPIs) helps you to find the noisiest and least valuable alerts so you can improve their value or remove them. You'll use the AQM process to collect and measure alert event volume and engagement KPIs, then use them to identify trends to fix issues that create serious problems. Below, you'll find information on all the KPIs, as well as a NRQL query for each one to help you monitor them from anywhere in the New Relic UI.

Alert event volume

You should treat alert events (with or without alerts) like a queue of tasks. Just like a queue, the number of alerts should always be as close to zero as possible. Each alert event should trigger an investigatory or corrective action to resolve the condition. If an alert doesn't result in some sort of action, then you should question the value of the alert condition.

In particular, if you see specific alert events that are frequently triggered, then you should question whether you're in a constant state of meaningful impact or if you simply have a large volume of noise. The alert event volume KPIs help you answer those questions and measure progress towards a healthy state of high quality alerting.

This is the number of alert events generated over a period of time. Typically you should compare the current and previous weeks.

Goal: Reduce the number of low value and nuisance alert events.

Best practices:

Ensure condition settings are intended to detect real business impact.
Ensure condition settings are detecting abnormal behavior.
Use the alert event details Acknowledge feature to help measure the value of alerts. See percentage alert event acknowledge KPI.
Report AQM KPIs to all stakeholders.
```
FROM NrAiIncident SELECT count(*) AS 'Incident Count' WHERE event = 'open' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO
```

This is the total sum of minutes that all alert events have accumulated over a period of time. Typically you should compare the current and previous weeks.

Goal: Reduce the total accumulated minutes of alert events.

Best practices:

Don't manually close alert events, as doing so can warp the accuracy of this KPI.
Remove alerts that don't result in any remediation actions from the recipients.
Improve percent investigated and mean-time-to-investigate KPIs by communicating their importance in improving detection and response times.

Report AQM KPIs to all stakeholders.

FROM NrAiIncident SELECT sum(durationSeconds)/60 AS 'Incident Minutes' WHERE event = 'close' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

This is the average duration of alert events within the period of time measured. You want this number to be as low as possible.

Goal: Reduce MTTC

Best practices:

Don't manually close alert events, as doing so can warp the accuracy of this KPI.
Improve reliability engineering skills.

Report AQM KPIs to all stakeholders.

FROM NrAiIncident SELECT average(durationSeconds/60) AS 'Incident MTTC (minutes)' WHERE event = 'close' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

This is the percentage of alert events with a total duration less than five minutes. This can indicate an alert event changing state too frequently, which obscures the cause and severity of the alert event. This state is known as alert event flapping.

Goal: Minimize percentage of alert events with short durations.

Best practices:

Ensure that conditions are detecting legitimate anomalies with meaningful impact on your system.

Understand service level management.

FROM NrAiIncident SELECT percentage(count(*), WHERE durationSeconds <= 5*60) AS '% Under 5min' WHERE event = 'close' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

User engagement

You should measure the value of an alert event by the amount of attention it receives. The amount of engagement an individual alert receives is a direct measurement of its value. More engagement implies a valuable alert, while less (or zero) engagement implies an alert may simply be noisy and should be modified or disabled.

There's a significant difference between measuring the moment of alert event awareness and acknowledging when resolution activity begins. If you're using an integration with New Relic alerts, be sure that the Acknowledge event sent to New Relic triggers when resolution activity begins, not when the alert event is sent to the external alert event management tool.

This identifies the percentage of alert events that with a true acknowledgement flag. You should compare the current and previous weeks.

Goal: Increase the percentage of alert event engagement.

Best practices:

Ensure that your DevOps team knows when it's appropriate to acknowledge an alert event alert, if applicable.
Gamify alert acknowledgement to drive usage.

Discourage mass acknowledgement exercises.

FROM NrAiIssue SELECT filter(count(*), WHERE event='acknowledge')/filter(count(*), WHERE event='create')*100 AS '% Investigated' WHERE priority='CRITICAL' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

This identifies the average time it takes for you to acknowledge an alert event. Typically you should compare the current and previous weeks.

Goal: Reduce the mean time to investigate.

Best practices:

Work at building alert event responder's confidence in alerts.
Ensure that valuable alerts are acknowledged.

Incentivize response teams to respond quickly to alerts.

FROM NrAiIssue SELECT average(acknowledgeTime - activateTime) / 60000 AS 'Incident MTTI (minutes)' WHERE event = 'acknowledge' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

What's next?

Once you implement the AQM process from the previous doc, you'll see significant reductions in the volume of alerts while maintaining reliability and stability. Your AQM KPIs can provide accurate inforation on these improvements when you follow the best practices as listed above.

Once you've finished implementing AQM, you can also look into improving and managing other aspects of your platform, such as:

Previous step

Learn how to improve your stack with alerts

Tip

Why use alert quality management?

Using key performance indicators

Alert event volume

Alert event count KPI

Accumulated alert event duration KPI

Mean time to close (MTTC) KPI

Percent under 5 minutes KPI

User engagement

Percentage acknowledged KPI

Mean time to investigate (MTTI) KPI

What's next?

Previous step

Manage your alert quality

Tip

Why use alert quality management? .css-21sua1{background:none;border:none;width:0;padding:0;}

Using key performance indicators

Alert event volume

Accumulated alert event duration KPI

Mean time to close (MTTC) KPI

Percent under 5 minutes KPI

User engagement

Percentage acknowledged KPI

Mean time to investigate (MTTI) KPI

What's next?

Previous step

Why use alert quality management?