Alerts offers NRQL conditions in three threshold types: static, baseline, and outlier. This document explains how the outlier threshold type works, gives some example use cases and NRQL queries, and explains how to create an outlier condition.
What is outlier detection?
In software development and operations, it is common to have a group consisting of members you expect to behave approximately the same. For example: for servers using a load balancer, the traffic to the servers may go up or down, but the traffic for all the servers should remain in a fairly tight grouping.
The NRQL alert outlier detection feature parses the data returned by your faceted NRQL query and:
- Looks for the number of expected groups that you specify
- Looks for outliers (values deviating from a group) based on the sensitivity and time range you set
Additionally, for queries that have more than one group, you can choose to be notified when groups start behaving the same.
For more on the rules and logic behind this calculation, see Outlier detection rules.
Note: this feature does not take into account the past behavior of the monitored values; it looks for outliers only in the currently reported data. For an alert type that takes into account past behavior, see Baseline alerting.
Example use cases
These use cases will help you understand when to use the outlier threshold type. Note that the outlier feature requires a NRQL query with a
- Notify if load-balanced servers have uneven workload
A load balancer divides web traffic approximately evenly across five different servers. You can set a notification to be sent if any server starts getting significantly more or less traffic than the other servers.
SELECT average(cpuPercent) FROM SystemSample WHERE apmApplicationNames = 'MY-APP-NAME' FACET hostname
- Notify if load-balanced application has misbehaving instances
Application instances behind a load balancer should have similar throughput, error rates, and response times. If an instance is in a bad state, or a load balancer is misconfigured, this will not be the case. Detecting one or two bad app instances using aggregate metrics may be difficult if there is not a significant rise in the overall error rate of the application.
You can set a notification for when an app instance’s throughput, error rate, or response time deviates too far from the rest of the group.
SELECT average(duration) FROM Transaction WHERE appName = 'MY-APP-NAME' FACET host
- Notify of changes in different environments
An application is deployed in two different environments, with ten application instances in each. One environment is experimental and gets more errors than the other. But the instances that are in the same environment should get approximately the same number of errors.
You can set a notification for when an instance starts getting more errors than the other instances in the same environment. Also, you can set a notification for when the two environments start to have the same number of errors as each other.
The number of logged in users for a company is about the same for each of four applications, but varies significantly by each of the three time zones the company operates in.
You can set a notification for when any application starts getting more or less traffic from a certain timezone than the other applications. Sometimes the traffic from the different time zones are the same, so you would set up the alert condition to not be notified if the time zone groups overlap.
For more details on how this feature works, see Outlier rules and logic.
Create an outlier alert condition
To create a NRQL alert that uses outlier detection:
- When creating a condition, under Select a product, select NRQL.
- For Threshold type, select Outlier.
- Create a NRQL query with a
FACETclause that returns the values you want to alert on.
- Depending on how the returned values group together, set the Number of expected groups.
- Adjust the deviation from the center of the group(s) and the duration that will trigger a violation.
- Optional: Add a warning threshold and set its deviation.
- Set any remaining available options and save.
Rules and logic
Here are the rules and logic behind how outlier detection works:
- Details about alert condition logic
After the condition is created, the query is run once every harvest cycle and the condition is applied. Unlike baseline alerts, outlier detection uses no historical data in its calculation; it's calculated using the currently collected data.
Alerts will attempt to divide the data returned from the query into the number of groups selected during condition creation.
For each group, the approximate average value is calculated. The allowable deviation you have chosen when creating the condition is centered around that average value. If a member of the group is outside the allowed deviation, it produces a violation.
If Trigger when groups overlap has been selected, Alerts detects a convergence of groups. If the condition is looking for two or more groups, and the returned values cannot be separated into that number of distinct groups, then that will produce a violation. This type of “overlap” event is represented on a chart by group bands touching.
Because this feature does not take past behavior into account, data is never considered to "belong" to a certain group. For example, a value that switches places with another value wouldn't trigger a violation. Additionally, an entire group that moves together also wouldn't trigger a violation.
- NRQL query rules and limits
The NRQL query must be a faceted query, and can only facet on one attribute. Queries that facet on more than one attribute won't work.
The number of unique values returned must be 500 or less. If the query returns more than this number of values, the condition won't be created. If the query later returns more than this number after being created, the alert will fail.
- Zero values for unreturned data
When a query returns a set of values, only values that are actually returned are taken into account. If a value is not available for calculation (including if it goes from being collected one harvest cycle to not being collected), it is rendered as a zero and is not considered. In other words, the behavior of unreturned zero values will never trigger violations.