New Relic offers distributed tracing for monitoring and analyzing the behavior of modern distributed systems. This document gives an overview of how our distributed tracing feature works. For an introduction to the feature and its benefits, see Introduction to distributed tracing.
Trace propagation using headers
When distributed tracing is enabled, the New Relic agent adds an HTTP or messaging header
newrelic to a service's outbound requests. The header contains information that helps us link the spans together later: metadata like the trace ID number, the ID number of the span, the ID number of the New Relic account, and sampling information.
When that service calls other New Relic-monitored services that also have distributed tracing enabled, the header continues to be propagated, and so on and so on, until the request has completed.
Some services may communicate through proxies or other intermediaries that don't automatically propagate the trace header. This would require configuration to allow it to pass that header. For more on that and other reasons a trace may have missing data, see the troubleshooting documentation.
Alignment with industry standards
Distributed tracing offerings from all vendors (including us) currently use proprietary solutions for propagating trace context. We are involved in the W3C Distributed Tracing working group, which is working toward a standard header name and content. This industry standard will allow distributed tracing to work consistently across vendor implementations. We will adopt this standard once it is approved.
Data collection and storage
A distributed trace generates both
Span event types:
Spanevent: A span represents a duration of time in a service or service function.
Spanevents are created by New Relic agents based on several sampling rules. To see attributes for this event type, see Span attributes.
Transactionevent: If an entity is monitored by a New Relic agent, a request to that entity generates a single
Transactionevent. Transaction data allows distributed trace data to be tied to other APM features, like custom attributes. To see attributes for this event type, see Transaction attributes.
Spans and transactions that have been selected for inclusion in a trace are combined to display complete traces on the Distributed tracing UI page. All events, including spans not displayed in the UI and transactions not part of a trace, can be queried via New Relic Insights.
For more on how distributed trace data is displayed in the New Relic UI, see Understand distributed trace data.
Sampling of trace data
Distributed tracing reports and processes a large amount of data. A typical trace will contain many events, each with its own set of attributes, and there will usually be many traces reported per minute for each New Relic agent.
For this reason, we use sampling to capture a representative sample of system activity. This sampled data ideally represents the characteristics of the larger data population.
More detail on the sampling process:
- Adaptive sampling process
New Relic APM agents use adaptive sampling to capture a representative sample of system activity. The following is an explanation of how adaptive sampling works. (AWS Lambda monitoring uses a different sampling process.)
For the first service in a distributed trace, 10 requests are chosen to be sampled. The throughput to that service is used to adjust how frequently requests are sampled. This is explained in more detail below.
The first New Relic-monitored service in a distributed trace is called the trace origin. The trace origin chooses requests at random to be traced. That decision propagates to the downstream services touched by that request. When the request has completed, all of the spans created by the services touched by that request are available in the New Relic UI as a complete end-to-end trace.
APM agents have a limit on the number of transactions collected per minute (this can vary, depending on agent) and a limit on the number of spans collected per minute (1000 per agent instance). To adhere to these limits, the default number of traces at the trace origin is 10 traces per minute.
An APM agent spreads out the collection of these 10 traces over a minute in order to get a representative sample over time. The exact sampling rate depends on the number of transactions in the previous minute. The rate responds to changes in transaction throughput, going up or down.
For example: if the previous minute had 100 transactions, the agent would anticipate a similar number of transactions and select 1 out of every 10 transactions to be traced.
- Agent span limits and sampling
A New Relic agent instance has a limit of 1,000 spans per minute. The agent attempts to keep all spans that are marked to be sampled as part of a distributed trace.
In many distributed systems, the average microservice may generate 10 to 20 spans per request. In those cases, the agent span limit can accommodate all spans chosen, and that service will have full detail in a trace.
However, some requests to services will generate many spans, and the agent span limit will be reached. As a result, some traces will not have full detail for that service. One solution to this would be to custom instrument a New Relic agent to report less activity and therefore report fewer spans.