Reduce noise with quality alerts

Redundant alerts bury critical incidents under noise. Assessing your existing alerts is a key part of your prioritization strategy, as the quality of your alerts translates into how well your teams respond to incidents. If there's too much noise, you risk fatiguing your team with low priority incidents that have little or no business impact. Incidents that fail to alert, however, lead to outages that affect customer experience.

Objectives

This tutorial assumes you already have active alerts. It offers some recommendations about managing the quality of your alerts and provides some NRQL queries for creating new ones. You will:

Install the alert quality management (AQM) dashboard
Differentiate between a good and bad alert
Review our recommended NRQL strings for creating alerts

Install the AQM dashboard

AQM focuses on reducing the number of nuisance incidents so that your team focuses on with true business impact. You're a good candidate for AQM if:

You have too many alerts.
You have alerts that stay open for long time periods.
Your customers discover your issues before your monitoring tools do.

To begin, install the AQM dashboard via our quickstart:

Go to the Alert Quality Management instant observability page.
Click on the Install now button.
Follow the prompts to choose the account you want to install the dashboard into.
View your dashboard.

We recommend you spend at least two weeks with the AQM dashboard. During that time, the AQM dashboard will collect data about how your teams interact with all of your alerts.

As a general rule, we recommend removing these types of alerts:

Generate "always-on" incidents that have thousands of minutes or more of cumulative duration.
Where 30% or more of incidents are open for less than 5 minutes.
Whose mean-time-to-close is longer than 30 minutes.
Create more than 350 incidents per week.

Create new alerts for peak demand

With your existing policies under review, you may want to create new alerts that are adjusted for peak demand. Creating a good alert depends on the specificity of your settings. Two alerts can share the same alert condition query, for example:

SELECT average(`apm.service.memory.heap.used`) FROM Metric WHERE appName = 'Inventory Service'

While the query itself is a strong alert policy, how you configure this alert can create redundancy or noise. A bad alert may have too small a window duration, a low threshold, or no delay or baseline. Additionally, attaching an alert condition to a relatively young data source can also create problems, as there's not enough history to detect anamlous behavior.

If you're ready to create new alerts, here are some recommended queries you can use for your gameday:

Create constrained alerts that target a specific segment of your data, such as a few key customers or a range of data. Use the WHERE clause to define those conditions.

SELECT average(duration) FROM Transaction WHERE account_id in (91290, 102021, 20230)

SELECT percentile(duration, 95) FROM Transaction WHERE name LIKE 'Controller/checkout/%'

Create alerts when an Nth percentile of your data hits a specified threshold; for example, maintaining SLA service levels. Since we evaluate the NRQL query based on the aggregation window duration, percentiles will be calculated for each duration separately.

SELECT percentile(duration, 95) FROM Transaction

SELECT percentile(databaseDuration, 75) FROM Transaction

Create alerts when your data hits a certain maximum, minimum, or average. This ensures that a duration or response time does not pass a certain threshold.

SELECT max(duration) FROM Transaction

SELECT average(duration) FROM Transaction

Create alerts when a proportion of your data goes above or below a certain threshold.

SELECT percentage(count(*), WHERE duration > 2) FROM Transaction

SELECT percentage(count(*), WHERE http.statusCode = '500') FROM Transaction

Create alerts on Apdex, applying your own T-value for certain transactions. For example, get an alert notification when your Apdex for a T-value of 500ms on transactions for production apps goes below 0.8.

SELECT apdex(duration, t:0.5) FROM Transaction WHERE appName like '%prod%'

1Get started

Get data about your architecture with APM and infrastructure agents

2Create service levels for gameday

Create service levels informed by your baseline

3Reduce noise with quality alerts

Evaluate your alerts with alert quality management

You are here

4Align your teams with workloads

Align your teams around the same data

5Autoscale your infrastructure with Kubernetes

Scale your resources as demand peaks

Objectives

Install the AQM dashboard

Create new alerts for peak demand

Alert on specific segments of your data

Alert on Nth percentile of your data

Alert when data hits a maximum, minimum, or average

Alert on a percentage of your data

Alert on Apdex with any T-value

1Get started

2Create service levels for gameday

3Reduce noise with quality alerts

4Align your teams with workloads

5Autoscale your infrastructure with Kubernetes

Reduce noise with quality alerts

Objectives .css-21sua1{background:none;border:none;width:0;padding:0;}

Install the AQM dashboard

Create new alerts for peak demand

Alert on Nth percentile of your data

Alert when data hits a maximum, minimum, or average

Alert on a percentage of your data

Alert on Apdex with any T-value

2Create service levels for gameday

3Reduce noise with quality alerts

4Align your teams with workloads

5Autoscale your infrastructure with Kubernetes

Objectives