Technical distributed tracing details

Here are some technical details about how New Relic distributed tracing works:

How trace sampling works
How we structure trace data
How we store trace data
How trace context is passed between applications

Trace sampling

How your traces are sampled will depend on your setup and the New Relic tracing tool you're using. For example, you may be using a third-party telemetry service (like OpenTelemetry) to implement sampling of traces before your data gets to us. Or, if you're using Infinite Tracing, you'd probably send us all your trace data and rely on our sampling.

We have a few sampling strategies available:

Head-based sampling (standard distributed tracing)
Tail-based sampling (Infinite Tracing)
No sampling

Head-based sampling (standard distributed tracing)

With the exception of our Infinite Tracing feature, most of our tracing tools use a head-based sampling approach. This applies filters to individual spans before all spans in a trace arrive, which means decisions about whether to accept spans are made at the beginning (the "head") of the filtering process. We use this sampling strategy to capture a representative sample of activity while avoiding storage and performance issues.

Once the first span in a trace arrives, a session is opened and maintained for 90 seconds. With each subsequent arrival of a new span for that trace, the expiration time is reset to 90 seconds. Traces that have not received a span within the last 90 seconds will automatically close. The trace summary and span data are only written when a trace is closed.

Here are some details about how head-based sampling is implemented in our standard distributed tracing tools:

Our language agents use adaptive sampling to capture a representative sample of system activity. The following is an explanation of how adaptive sampling works.

The throughput to the first service in a trace is used to adjust how frequently requests are sampled. This is explained in more detail below, and you can also consult the documentation for your APM agent.

The first service we monitor in a distributed trace is called the trace origin. The trace origin chooses requests at random to be traced. That decision propagates to the downstream services touched by that request. When the request has completed, the spans generated by these requests are reported to New Relic and made available in the UI as a complete trace (though agent span limits described below may result in fragmented traces).

The trace origin service samples 10 traces per minute. It attempts to spread out the collection of these 10 traces over a minute in order to get a representative sample over time. The exact sampling rate depends on the number of transactions in the previous minute, and adapts to changes in transaction throughput.

For example, if there were 100 transactions in the previous minute, the agent would anticipate a similar number of transactions and select 1 out of every 10 transactions to be sampled over the next minute.

APM agents have a limit on the number of spans collected per minute, with a default limit of 2000 spans collected per minute per agent instance (for how to adjust this, see APM agent configuration documentation). If an agent generates more spans than its configured limit in a minute, some of the spans will be dropped, resulting in fragmented traces in the UI. Traces are assigned a random priority when they are selected for sampling, so if multiple agents need to drop spans they can attempt to keep higher-priority traces intact by first dropping spans from lower-priority traces.

Tail-based sampling (Infinite Tracing)

Our Infinite Tracing feature uses a tail-based sampling approach. "Tail-based sampling" means that trace-retention decisions are done at the tail end of processing after all the spans in a trace have arrived.

With Infinite Tracing, you can send us 100% of your trace data from your application or third-party telemetry service, and Infinite Tracing will figure out which trace data is most important. And you can configure the sampling to ensure the traces important to you are retained.

Important

Because Infinite Tracing can collect and forward more trace data from your application or third-party telemetry service, you may find your egress costs increase as a result. We recommend that you keep an eye on those costs as you roll out Infinite Tracing to ensure this solution is right for you.

For Infinite Tracing, agents or integrations send 100% of all instrumented spans to a trace observer. The trace observer is a distributed tracing service residing in a cluster of services on AWS called New Relic Edge.

Tip

Only your spans go to the trace observer—all other data such as metrics, custom events, and transaction traces are sent the normal route to New Relic and are subject to local sampling.

You configure a unique trace observer endpoint for the AWS region you want to send data to. Because tracing is a cross-account feature, our default implementation is to only allow one trace observer per region, per account family (to request more, talk to your account representative). The endpoint represents a trace observer for a particular workload. For example, all spans from a single trace (request) must go to that endpoint.

Here are two architectural diagrams: one showing how data flows if you use agents and another if you use New Relic integrations like OpenTelemetry exporters:

Here are two diagrams showing the flow of data: one for agents and another for integrations with Infinite Tracing.

The trace observer holds traces open while spans for that trace arrive. Once the first span in a trace arrives, a session is kept open for 10 seconds. Each time a new span for that trace arrives, the expiration time is reset to 10 seconds. Traces that haven't seen a span arrive within the last 10 seconds will automatically expire.

By default, each trace observer offers traces to three samplers: one looking for duration outliers, one looking for traces with errors, and one trying to randomly sample across all trace types. Each sampler keeps a target percentage of traces that match their criteria.

Here are details about each sampler:

Sampler	Matching criteria	Target percent
Duration	Traces with an outlier duration, using two algorithms: Gaussian (Assumes a normal distribution and a threshold at the 99th percentile) Eccentricity (Assumes no distribution and a threshold based on cluster)	100%
Error	Traces having at least one span with an error	100%
Random	All traces	1% (This is configurable. See Infinite Tracing: Random trace filter)

Sampler

Matching criteria

Target percent

Duration

Traces with an outlier duration, using two algorithms:

Gaussian (Assumes a normal distribution and a threshold at the 99th percentile)
Eccentricity (Assumes no distribution and a threshold based on cluster)

100%

Error

Traces having at least one span with an error

100%

Random

All traces

1% (This is configurable. See Infinite Tracing: Random trace filter)

If the matching criteria matches the trace, each sampler looks at the trace’s shape. A trace’s shape is the unique combination of the root span’s entity name and span name. This is a simple way to separate traces using the entry point of the request.

Once the shape is determined, the sampler makes a decision to keep or reject the trace based on its target sampling percent. If it’s 100%, the trace is automatically kept. If it’s anything less, the probability the sampler keeps a given trace is determined by the target percent. For example, the default target percent is 1 for random traces, so 1% of those traces are kept. If you prefer, you can change the random filter percentage.

Because the trace observer uses percentages of throughput, the number of traces selected will vary with that throughput.

No sampling

Some of our tools don't use sampling. Sampling details for these tools:

Browser monitoring distributed tracing and mobile monitoring report all spans.

Our language agents are often used in conjunction with and , and our language agents use sampling. This means that there will likely be many more browser and mobile spans than backend spans, which can result in browser and mobile app spans disconnected from backend spans. For tips on querying for traces that contain front and backend spans, see Find browser span data.

Trace limits

Our data processing systems include internal limits to protect our infrastructure from unexpected data surges: Trace API, agent spans, browser spans, mobile spans, and lambda spans. This protective layer not only maintains the integrity of our platform but also ensures a dependable and consistent experience for all our customers. We adjust these limits as needed based on various conditions, but they're set with a forward-looking approach. As our users and data grow, we expand our infrastructure capacity and raise these limits. This commitment ensures that we capture all customer data sent our way and offer you a clear and uninterrupted view into your trace data. To check if you are hitting these limits, you can refer to the Limits UI.

How trace data is structured

Understanding the structure of a distributed trace can help you:

Understand how traces are displayed in our UI
Help you query trace data

A distributed trace has a tree-like structure, with "child" spans that refer to one "parent" span. This diagram shows some important span relationships in a trace:

New Relic distributed tracing trace structure diagram

This diagram shows how spans in a distributed trace relate to each other.

This diagram shows several important concepts:

Trace root. The first service or process in a trace is referred to as the root service or process.
Process boundaries. A process represents the execution of a logical piece of code. Examples of a process include a backend service or Lambda function. Spans within a process are categorized as one of the following:
- Entry span: the first span in a process.
- Exit span: a span is a considered an exit span if it a) is the parent of an entry span, or b) has http. or db. attributes and therefore represents an external call.
- In-process span: a span that represents an internal method call or function and that is not an exit or entry span.
Client spans. A client span represents a call to another entity or external dependency. Currently, there are two client span types:
- Datastore. If a client span has attributes prefixed with db. (like db.statement), it's categorized as a datastore span.
- External. If a client span has attributes prefixed with http. (like http.url) or has a child span in another process, it's categorized as an external span. This is a general category for any external calls that aren't datastore queries. If an external span contains http.url or net.peer.name, it's indexed on the External services page.
Trace duration. A trace's total duration is determined by the length of time from the start of the earliest span to the completion of the last span.

You can query span relationship data with the NerdGraph GraphiQL explorer at api.newrelic.com/graphiql.

How trace data is stored

Understanding how we store trace data can help you query your trace data.

We save trace data as:

Span: A span represents operations that are part of a distributed trace. The operations that a span can represent include browser-side interactions, datastore queries, calls to other services, method-level timing, and Lambda functions. One example: in an HTTP service, a span is created at the start of an HTTP request and completed when the HTTP server returns a response. Span attributes contain important information about that operation (such as duration, host data, etc.), including trace-relationship details (such as traceId, guid). For span-related data, see span attributes.
Transaction: If an entity in a trace is monitored by an agent, a request to that entity generates a single Transaction event. Transactions allow trace data to be tied to other New Relic features. For transaction-related data, see transaction attributes.
Contextual metadata. We store metadata that shows calculations about a trace and the relationships between its spans. To query this data, use the NerdGraph GraphiQL explorer.

How trace context is passed between applications

We support the W3C Trace Context standard, which makes it easier to trace transactions across networks and services. When you enable distributed tracing, New Relic agents add HTTP headers to a service's outbound requests. HTTP headers act like passports on an international trip: They identify your software traces and carry important information as they travel through various networks, processes, and security systems.

The headers also contain information that helps us link the spans together later: metadata like the trace ID, span ID, the New Relic account ID, and sampling information. See the table below for more details on the header:

Item	Description
`accountId`	This is your New Relic account ID. However, only those on your account and New Relic Admins can associate this Id with your account information in any way.
`appId`	This is the application ID of the application generating the trace header. Much like `accountId`, this identifier is not going to provide any information unless you're a user on the account.
`guid`	With Distributed Tracing, each segment of work in a trace is represented by a `span`, and each span has a `guid` attribute. The `guid` of the last span within the process is sent with the outgoing request so that the first segment of work in the receiving service can add this `guid` as the `parentId` attribute which connects data within the trace.
Parent type	The source of the trace header, as in mobile, browser, Ruby app, etc. This becomes the `parent.type` attribute on the transaction triggered by the request this header is attached to.
Priority	A randomly generated priority ranking value that helps determine which data is sampled when sampling limits are reached. This is a float value set by the first New Relic agent that’s part of the request so all data in the trace will have the same priority value.
Sampled	A boolean value that tells the agent if traced data should be collected for the request. This is also added as an attribute on any span and transaction data collected.
Timestamp	Unix timestamp in milliseconds when the payload was created.
`traceId`	The unique ID (a randomly generated string) used to identify a single request as it crosses inter- and intra- process boundaries. This ID allows the linking of spans in a distributed trace. This also is added as an attribute on the span and transaction data.
`transactionId`	The unique identifier for the transaction event.
Trusted account key	This is a key that helps identify any other accounts associated with your account. So if you have multiple sub-accounts that the trace crosses, we can confirm that any data included in the trace is coming from a trusted source, and tells us what users should have access to the data.
Version and data key	This identifies major/minor versions, so if an agent receives a trace header from a version with breaking changes from the one it is on, it can reject that header and report the rejection and reason.

This header information is passed along each span of a trace, unless the progress is stopped by something like middleware or agents that don't recognize the header format (see Figure 1).

Diagram of a failed trace with proprietary headers.

Figure 1

To address the problem of header propagation, we support the W3C Trace Context specification that requires two standardized headers. Our latest W3C New Relic agents send and receive these two required headers, and by default, they also send and receive the header of the prior New Relic agent:

W3C (traceparent): The primary header that identifies the entire trace (trace ID) and the calling service (span id).
W3C (tracestate): A required header that carries vendor-specific information and tracks where a trace has been.
New Relic (newrelic): The original, proprietary header that is still sent to maintain backward compatibility with prior New Relic agents.

This combination of three headers allows traces to be propagated across services instrumented with these types of agents:

W3C New Relic agents
Non-W3C New Relic agents
W3C Trace Context-compatible agents

Important

If your requests only touch W3C Trace Context-compatible agents, you can opt to turn off the New Relic header. See the agent configuration documentation for details about turning off the newrelic header.

The scenarios below show various types of successful header propagation.

Trace sampling