Cumulative metrics (OTel and Prometheus)

If you report cumulative metrics from our OpenTelemetry integration or our Prometheus remote write integration, it will help you understand how New Relic handles that data (for example, how we convert that to delta measurements). This will help you understand your New Relic UI views, help you query your data, and help you understand data-reporting issues.

Cumulative and delta metrics explained

When collecting metric data from an application, it's important to consider how that data was measured when deciding how to use and interpret it at query time. The metric type is one important factor, with certain New Relic aggregation functions working with some types and not others. But another important factor is the metric's temporality.

The two temporalities are delta and cumulative. Delta indicates that measurements are reset between reporting intervals. Cumulative indicates there is no reset and the measurements are accumulated. Prometheus is a common example of a cumulative metrics collector (Prometheus docs on data types), and OpenTelemetry also defines ways to collect cumulative metrics (OpenTelemetry docs on temporality).

New Relic supports both Prometheus and OpenTelemetry cumulative data and will perform delta conversion at ingest to make it more aligned with other metrics on our platform and easier to interact with that data via NRQL. Cumulative counters are stored as a New Relic cumulativeCount, and cumulative histograms are stored as a New Relic distribution.

Prometheus remote write support

For more information, see Prometheus remote write integration.

OpenTelemetry support

For more on OpenTelemetry support, see OpenTelemetry metrics: Best practices.

Cumulative to delta conversion details

At a high level, delta conversion is performed by taking two data points sequential in event time and computing a difference. However, things are never this simple in practice. Here are some common scenarios we anticipate and how we handle them.

Resets

If data for a time series suddenly decreases in value, we treat this as a reset and will emit that new measurement as its own delta value (in other words, as if it were preceded by a 0 measurement).

OpenTelemetry also defines situations where a decrease in value is unexpected, and we do our best to detect these cases and notify you via New Relic integration errors (see troubleshooting below).

Reordering data

We understand that many things can cause data points to arrive at New Relic out of order. As such, we will buffer data points and reorder them if we detect an unexpected gap in the reporting time series. Gaps are inferred by an expected reporting interval determined by the rate at which we receive data for a given time series. Buffering is bounded and eventually we will consider a data point "too late for resequencing". In this case, a delta is computed across the detected gap and processing of the time series continues.

Stale data

As delta conversion is a stateful operation, we must be cognizant of time series that may stop reporting and eventually drop its state. If a time series has not reported any new data points for 5 minutes, we will flush the state we have, including computing deltas across any buffered gaps. This means that if a data point arrives at a later point in time, it will be treated as if it were the beginning of that time series, effectively losing the delta between the last data point before the flush and the first data point after the flush. This means that metric reporting intervals should be less than 5 minutes to get the benefit of delta conversion.

Special note about cumulative sums

Even though the data was recorded by your application and sent as a sequence of monotonically increasing values, calling sum() on it will treat it as though it were a sequence of delta values; no need to compute any derivative()!

When converting a sum to a delta, New Relic will also emit the cumulative value alongside the delta to maintain the ability for you to query the latest cumulative value. To access the cumulative value, you can use getField() (see How to query metrics for examples).

Note that data points are plotted at their associated timestamps which are the start of an interval. However, cumulative values are associated with the endTimestamp of the data point, so you may need to consider the width of a data point when interpreting cumulative queries.

Troubleshooting

In some cases we will report an New Relic integration error as a result of the cumulative to delta conversion process. Here are some common reasons.

Translation errors

Delta conversion involves the assumption that two data points sequential in event time will have monotonically increasing cumulative values. The only time this assumption is expected to break is when the process being monitored is restarted. If monotonicity is broken for any other reason we will still treat this as a reset, but will attempt to notify you by generating a New Relic integration error event into your account. This is possible to do for OpenTelemetry data but not Prometheus, because OpenTelemetry includes more information that can be used to detect such situations. The most common cause for an unexpected break in monotonicity is when the client side application hits cardinality limits and drops data to relieve memory pressure. In certain cases, this acts as an unexpected reset and can result in an unexpected decrease in values sent to New Relic. It is recommended that you look for instances of this in your OTLP logs:

Instrument % has exceeded the maximum allowed accumulations (2000)

The OpenTelemetry SDK allows you to configure its cardinality limits. OpenTelemetry also provides a way to reduce cardinality client side using Views and is the recommended path to fix these issues. Another option is to explore exporting your OTLP metrics using Delta temporality which can help save on memory.

Cardinality limits

During translation, we also loosely enforce metric cardinality limits that are based on your metric entitlements as a system protection. Rather than enforcing the limit per day, as we do with rollups, this limit is enforced as the number of concurrent time series being tracked. Once there are too many concurrent unique metric time series, we will drop new incoming time series until an old one ages out (see Stale data).

Cumulative metric resets

Cumulative metric resets typically occur when the service or application reporting them restarts. When querying a metric that has reset, it may appear as though the metric has decreased in value however this is expected behavior as described in the OpenTelemetry metric specification. To differentiate between normal metric resets and a problematic metric that is decreasing unexpectedly, check your account for any New Relic integration errors. If no integration errors are reported in relation to metrics with decreasing values, it is likely that your application reporting the cumulative metrics is restarting and resetting the metric value.