This document contains technical details about how New Relic distributed tracing works, including how trace data is structured and stored.
New Relic trace structure
Understanding the structure of a distributed trace can help you:
A distributed trace has a tree-like structure, with "child" spans that refer to one "parent" span. This diagram shows some important span relationships in a trace:
This diagram shows several important concepts:
- Trace root. The first service or process in a trace is referred to as the root service or process.
- Process boundaries. A process represents the execution of a logical piece of code. Examples of a process include a backend service or Lambda function. Spans within a process are categorized as one of the following:
- Entry span: the first span in a process.
- Exit span: a span is a considered an exit span if it a) is the parent of an entry span, or b) has
db.attributes and therefore represents an external call.
- In-process span: a span that represents an internal method call or function and that is not an exit or entry span.
- Client spans. A client span represents a call to another entity or external dependency. Currently, there are two client span types:
- Datastore. If a client span has any attributes prefixed with
db.statement), it's categorized as a datastore span.
- External. If a client span has any attributes prefixed with
http.url) or has a child span in another process, it's categorized as an external span. This is a general category for any external calls that are not datastore queries.
- Datastore. If a client span has any attributes prefixed with
- Trace duration. A trace's total duration is determined by the length of time from the start of the earliest span to the completion of the last span.
How trace data is stored
Understanding how New Relic stores trace data can help you query your trace data.
New Relic saves trace data as:
- Events. Event data associated with traces include:
Spanevent: A span represents operations that are part of a distributed trace. The operations that a span can represent include datastore queries, calls to other services, and method-level timing. For example, in an HTTP service, a span is created at the beginning of an HTTP request and completed when the HTTP server returns a response. Span attributes contain important information about the operation that the span represents (such as duration, host data, etc.), including trace state details (such as traceId, guid). For span-related data, see span attributes.
Transactionevent: If an entity in a trace is monitored by a New Relic agent, a request to that entity generates a single
Transactionevent. Transactions allow trace data to be tied to other New Relic features. For transaction-related data, see transaction attributes.
- Contextual metadata. In addition to events, additional metadata shows calculations about a trace and the relationships between its spans. To query this data, use the NerdGraph GraphiQL explorer at https://api.newrelic.com/graphiql.
How trace context is passed between applications
When distributed tracing is enabled, the New Relic agent adds an HTTP or messaging header
newrelic to a service's outbound requests. The header contains information that helps us link the spans together later: metadata like the trace ID number, the ID number of the span, the ID number of the New Relic account, and sampling information.
When that service calls other New Relic-monitored services that also have distributed tracing enabled, the header continues to be propagated, and so on and so on, until the request has completed.
Some services may communicate through proxies or other intermediaries that don't automatically propagate the trace header. This would require configuration to allow it to pass that header. For more on that and other reasons a trace may have missing data, see the troubleshooting documentation.
Distributed tracing offerings from all vendors (including New Relic) currently use proprietary solutions for propagating trace context. We are involved in the W3C Distributed Tracing working group, which is working toward a standard header name and content. This industry standard will allow distributed tracing to work consistently across vendor implementations. We will adopt this standard once it is approved.
Sampling and limits
Distributed tracing requires the reporting and processing of a large amount of data. For this reason, we have limits on data reporting and use sampling to capture a representative sample of activity. The sampled data ideally represents the characteristics of the larger data population.
Some details on limits and sampling:
- Adaptive sampling process
New Relic APM agents use adaptive sampling to capture a representative sample of system activity. The following is an explanation of how adaptive sampling works. (AWS Lambda monitoring uses a different sampling process.)
For the first service in a distributed trace, 10 requests are chosen to be sampled. The throughput to that service is used to adjust how frequently requests are sampled. This is explained in more detail below.
The first New Relic-monitored service in a distributed trace is called the trace origin. The trace origin chooses requests at random to be traced. That decision propagates to the downstream services touched by that request. When the request has completed, all of the spans created by the services touched by that request are available in the New Relic UI as a complete end-to-end trace.
APM agents have a limit on the number of transactions collected per minute (this can vary, depending on agent) and a limit on the number of spans collected per minute (1000 per agent instance). To adhere to these limits, the default number of traces at the trace origin is 10 traces per minute.
An APM agent spreads out the collection of these 10 traces over a minute in order to get a representative sample over time. The exact sampling rate depends on the number of transactions in the previous minute. The rate responds to changes in transaction throughput, going up or down.
For example: if the previous minute had 100 transactions, the agent would anticipate a similar number of transactions and select 1 out of every 10 transactions to be traced.
- Agent span limits and sampling
A New Relic agent instance has a limit of 1,000 spans per minute. The agent attempts to keep all spans that are marked to be sampled as part of a distributed trace.
In many distributed systems, the average microservice may generate 10 to 20 spans per request. In those cases, the agent span limit can accommodate all spans chosen, and that service will have full detail in a trace.
However, some requests to services will generate many spans, and the agent span limit will be reached. As a result, some traces will not have full detail for that service. One solution to this would be to custom instrument a New Relic agent to report less activity and therefore report fewer spans.
- Trace rate limiting
If the above sampling methods still result in too much trace data, New Relic may limit incoming data by sampling traces after they're received. By making this decision at the trace-level, it avoids fragmenting traces (accepting only part of a trace).
This process works similarly to adaptive sampling. The total spans received in a minute are totaled. If too many spans are received, fewer spans may be accepted in the following minute, in order to achieve a "floating average" throughput rate.