New Relic events: limits and sampling

This document will explain:

  • Why event-reporting limits are necessary
  • How sampling works
  • How to work with and understand sampled data

Difference between events and metrics

This document is about limits on event data and how those limits lead to sampling. First, it may help you to understand the differences between these two types of data:

  • Metrics: aggregated measurements over time. Examples: average response time over a one-minute time range, throughput over time, CPU utilization over time.
  • Events: discrete events that happen at a specific moment in time. Examples: a log or error being reported, or a configuration change. Some events are aggregated over time to form metrics (for example: a count of errors over time).

These two types of data have different uses: metrics are useful for recognizing patterns over time in your system, and events are useful for drilling down and getting detail about the causes of those patterns. Because metrics are aggregated over time, they’re useful for spotting trends and changes in system behavior. For metrics that represent an aggregated count of events (like an HTTP response time metric), the individual events give you granular detail about what happened, and let you facet by attributes that have high cardinality (like account or user ID).

Why sampling of events is necessary

Our APM agents and Mobile agents have limits on how many events can be reported per harvest cycle. This is necessary because if there were no limit, a very large number of events being sent could have performance impacts on your application or on New Relic. When the limit is reached, the agents begin sampling events. Different agents have different limits, but the goal is to give a representative sample from across the harvest cycle.

Also, agents may sample if they can't connect to New Relic. When an agent can’t connect to New Relic, it continues to store data locally. But it must restrict the size of the payload that is eventually sent. For this reason, it samples events during the disconnected period. The longer it is disconnected, the more it will sample.

The impact of sampling

One result of sampling is that there can be a discrepancy between unsampled metric data and sampled event data. Examples of this:

  • An APM chart showing unsampled metric data may show you higher throughput than an equivalent NRQL query of sampled event data. For more about the difference between our metric timeslice data and event data, see Data types.
  • A non-New Relic monitoring service may show different results from New Relic.

Events that are capped and subject to sampling include:

For APM: see a way to compensate for sampling when querying data.

Change how sampling occurs

Before attempting to change how sampling occurs, please read these caveats and recommendations:

  • Reporting more events will result in the agent using more memory.
  • There will usually be a way for you to get the data you need without raising an agent's event-reporting limit.
  • The payload size limit is 1MB (compressed), so the number of events may still be affected by that limit. To determine if events are being dropped, see the agent log for a 413 HTTP status message.

Here are some ways to impact sampling:

  • Most agents have configuration options for changing the limit on sampled transactions (examples: Java agent’s max_samples_stored or the Android Mobile agent’s setMaxEventPoolSize).
  • If it’s important to you that a specific app activity not be sampled, you can use the Event API.
  • You could deploy your application across a greater number of instances. Because the limits are per-agent, more agents will mean a larger event reservoir.

APM: Compensating for sampling

When querying APM-reported events, you can compensate for sampling by using EXTRAPOLATE, which will give you an approximation of what unsampled data looks like.

For more help

Recommendations for learning more: