One of the promised outcomes of implementing service levels is that you will be able to adjust your alert policies, and cut notifications down to those issues that are actually damaging your client experience and pose a risk to your business.
When you set service level objectives, you can configure alerts that will inform you when your service is consuming the error budget too quickly. These alerts will show you when incidents with a high business impact are occurring. When they are triggered, they should be given a high priority and you should engage the proper teams to start the process of diagnosing the source of the problem.
The idea behind burn rate alerts is the following: the error budget represents how many bad events you can afford over the SLO period; by definition if you spend all your error budget at a constant rate, then your burn rate = 1. Then, any burn rate above the tolerable burn rate would not be sustainable because you would have completely burnt the error budget before the end of the SLO period; therefore you might want to get alerted if that's the case for a continued amount of time.
You'll find an option to create alerts in the service level details page.
Click Alert to go to an alerts configuration card. This card will be pre-populated following Google recommendation for SLO budget consumption percentages, specifically for fast-burn alerts. These alerts will warn you of a sudden, large change in consumption that, if uncorrected, will exhaust your error budget very soon. We will be setting a 2% SLO budget consumption within 1 hour, this means that the service would consume the error budget completely in 50 hours if left unattained.
Click Review settings to see the alert condition settings and edit them.
In case you don't want to follow Google recommendation for fast-burn alerts, you can set up your own thresholds.
The error budget burn rate indicates how fast the service consumes the error budget, taking into account the whole SLO period. Here's a formula for calculating it:
critical burn rate = (tolerated budget consumption * SLO period [h]) / (evaluation period [h])
- Tolerated budget consumption: How much budget you tolerate to consume in the evaluation period.
- SLO period: time window of your SLO (generally, in hours).
- Evaluation period: aggregation window we are taking into consideration (you can use 1 hour on the alert condition aggregation window for simplicity).
In order to define your alert threshold, you will multiply the critical burn rate per hour by the error budget:
threshold = error budget * critical burn rate
Let's see how this works with an example for a 28 day SLO with a 99.9% target.
For a 28 day SLO, Google recommends alerting on a 2% SLO budget consumption in the last hour. That means that if you keep burning the budget at the same rate, you would breach your SLO in 50 hours (resulting from
100% / 2%).
Then we have the following variables:
- SLO target:
- SLO period:
28 days (28 * 24 hours)
- Tolerated budget consumption:
- Evaluation period:
critical burn rate per hour = (0.02 * 28 * 24) / 1 = 13.44
threshold = 0.1 * 13.44 = 1.344
This would be a value that you would use as the alert condition threshold: Open a violation when the query returns a value above the threshold (in this example, 1.344), at least once in the evaluation period (in this example, 60 minutes).
If you edit the SLO target on the Service levels side, remember to edit the target on the alert condition as well.
It's important to tune the additional parameters of this alert condition.
Set the window duration to the evaluation period. Following the previous example, you would set 60 minutes, meaning the alert system would aggregate 60 minutes of data.
The evaluation period supports aggregating up to 2 hours of data.
You can use a 60 seconds slide by interval, so that every minute New Relic evaluates the 60 previous minutes of data.
Next, connect the condition to the policy that determines how notifications are managed.
Last, you can choose when to automatically close any open violations.
New Relic Alerts can aggregate up to 2 hours of data. Therefore, New Relic doesn't provide yet:
- The ability to alert on SLO compliance over the whole SLO period.
- The ability to alert on the remaining error budget.