Custom anomaly detection

Custom anomalies allow your team the most versatility when detecting unusual behavior in your system. Not only are they flexible and dynamic, custom anomalies provide your team with the capability to alert on any entity or signal and to adjust and optimize your thresholds. Custom anomalies are built using the same advanced tuning settings as static alerting so you can ensure your team sees only incidents that are important to you.

You can also enrich your custom anomaly detection configuration with additional metadata to provide further context and add custom incident descriptions that can provide additional instructions to your on-call engineers.

Configure custom anomaly thresholds

You can create custom anomalies thresholds from an alert condition. Here are some tips for setting anomaly thresholds:

  • Set the anomaly direction to monitor incidents that happen either above or below the anomaly.
  • Use the slider bar to adjust the Critical threshold sensitivity, represented in the preview chart by the light gray area around the anomaly. The tighter the band around the anomaly, the more sensitive it is and the more incidents it will generate.
  • You can create a Warning threshold (the darker gray area around the anomaly).

Follow these steps to create your custom anomaly:

  1. Go to > Alerts & AI and select Alerts conditions (policies) in the left navigation panel.

  2. Click + New alert condition.

  3. Define your signal and click Next.

  4. Adjust the signal behavior and click Next.

  5. Click the Try anomaly thresholds link.

  1. After selecting 'anomaly' as your threshold type, you'll need to configure the settings for one or more thresholds. Anomaly detection makes a prediction on what the next data point will be based on prior activity. The threshold value for custom anomaly detection is the number of standard deviations your signal value is away from the value that was predicted.

    To configure the threshold, you'll need to:

    • Set the 'threshold direction' to either upper, lower, or both. This means that we'll only create an incident if the signal value (the output of the query) is above the predicted value, below the predicted value, or either, respectively.

      This field dictates how many of the data points during a specified time period must be outside of the threshold. The options are for at least and at least once in. Selecting for at least means that ALL of your signal's data points must be outside of the threshold for the specified time period before an incident is opened. The inverse must be true to close the incident. The at least once in option simply means that as soon as any of your signal's data points are outside of the threshold, an incident will open. With this option, the time duration is not relevant for determining when to open an incident. However, it's relevant for closing incidents. All of your signal's data points must be within the threshold for the specified period of time

    • Set the 'threshold duration'. Think of this as how long the signal value must remain outside the threshold before an incident is opened. Conversely, it's also how long a signal must be within the threshold for an incident to be closed.

      This field answers to the time period mentioned above. It's how long the signal must be in violation of the threshold being defined. This is the actual threshold duration.

    • Set the 'threshold level'. For custom anomaly detection, this is the number of standard deviations the signal's data point is from the value that we predicted it would be.

  1. Click Next.

  2. Add the details of the alert condition and click Save condition.

Setting thresholds for multi-signal conditions (faceted queries)

Depending on how you defined your query in step 1, the alert condition may be monitoring many signals, not just one. When working with NRQL, these queries use the FACET clause. The maximum number of signals that one alert condition can monitor is 5,000. The threshold settings that you specify apply the same to all signals being monitored by this condition. Each signal is individually monitored and evaluated, but the settings apply consistently to all signals. We'll only show a maximum of 500 signals on the preview chart. But, we don't show the predicted signal and threshold bands when there is more than one signal shown on the chart. To show that information while determining the ideal threshold value, select one of the time series signals from the legend to filter the chart down a single time series.

Anomaly direction: select upper or lower ranges

You can choose whether you want the condition to look for behavior that goes above the predicted value ("upper") or that goes below the predicted value ("lower"), or that goes either above or below. You choose these with the prediction direction selector.

Example use cases for this:

  • You might use the Upper setting for a data source like error rate, because you generally are only concerned if it goes up, and aren't concerned if it goes down.
  • You might use the Lower setting for a data source like throughput, because sudden upward fluctuations are quite common, but a large sudden downswing would indicate a problem.

Here are examples of how large fluctuations in your data would be treated under the different anomaly direction settings. The red areas represent incidents.

Rules governing calculation of the predicted value

The algorithm for calculating the prediction is mathematically complex. Here are some of the major rules governing its predictive abilities:

  • Age of data On initial creation, the prediction is calculated using between 1 to 4 weeks of data, depending on data availability and prediction type. After its creation, the algorithm takes into account ongoing data fluctuations over a long time period, although greater weight is given to more recent data. For data that has only existed for a short time, the predicted value will likely fluctuate a good deal and not be very accurate. This is because there isn't enough data to determine its usual values and behavior. The more history the data has, the more accurate the prediction and thresholds will become.
  • Consistency of data For metric values that remain in a consistent range or that trend slowly and steadily, their more predictable behavior means that their thresholds will become tighter around the prediction. Data that is more varied and unpredictable will have looser (wider) thresholds.
  • Regular fluctuations For shorter-than-one-week cyclical fluctuations (such as weekly Wednesday 1pm deployments or nightly reports), the prediction algorithm looks for these cyclical fluctuations and attempts to adjust to them.