• /
  • Log in
  • Free account

Analyze distributed systems

In a monolithic application, a simple stack trace can contain enough diagnostic data to determine the root cause of a code defect. But cloud computing and microservices have blurred the lines between software and infrastructure; in modern architectures, requests are distributed across many smaller services—often with ephemeral lifespans—hosted in both on-premise and cloud environments. Spotting code defects becomes much more complex. APM's distributed tracing automatically helps teams troubleshoot such distributed systems.

Distributed traces are just one component of a well-monitored system. You need a holistic view of your distributed system, especially when tracking the root cause of a defect, as there are volumes of data to evaluate and understand. When managing a microservices environment, it's critical that you have the capability to spot bottlenecks and problem spans quickly so that you don't compromise your mean-time-to-resolution (MTTR) or end-user experience.

At New Relic, we understand these challenges inherently. In our journey, we've transitioned from a Ruby monolith to a multi-language distributed environment built on more than 300 microservices, for which we average 50 code deploys a day. Such challenges inform both how we've built and how we monitor New Relic.

Use the New Relic platform to translate your data into relevant insights, so you can collaborate around a common framework to build context and quickly optimize and troubleshoot your complex, distributed systems.

1. Identify high-priority areas to monitor

Distributed systems are complex. Before engaging with the New Relic platform, we recommend that you identify the most critical areas of your systems to observe, and focus on instrumenting those high-priority areas first. Google's SRE handbook suggests monitoring the "four golden signals": traffic, latency, errors, and saturation, as shown in the following dashboard:


Dashboards: Visualize key areas to monitor with dashboards.

Too often, teams monitor what is either 1) easy to measure or 2) interpretable. Avoid this fallacy. When making choices about what to monitor, involve product managers and other stakeholders from your organization. Your goal is to monitor what matters to your business, not to overload your teams with noise.

2. Instrument to get the visibility you require

Once you've identified your key priorities, instrument the appropriate parts of your system with the New Relic platform.

3. Create dashboards

Use dashboards to get an overview of your entire system and baseline performance, so you can better understand how the components work together. Insights provides a single framework for aligning disparate teams around relevant data. Approach this first build as your launching point, giving your teams something to react to as you begin to ascribe context to your system. You should continue to build and iterate your dashboards as you analyze, troubleshoot, and optimize your distributed system.

Even after you have a basic understanding of your system, use dashboards to inform the decisions you make going forward. Dashboards provide central repository of truth, allowing all stakeholders to build context about the health of your system. Here is an example of a query that displays the slowest applications.

SELECT percentile(duration, 99) as 'Slowest duration' FROM Transaction FACET name

Using this query you can focus on the slowest application to dig deeper:


After running the query, dashboards displays the slowest durations.

4. Dig deeper with distributed tracing

After you have basic instrumentation and contextual dashboards in place, you can begin to dig deeper to troubleshoot or optimize your system. A differentiated feature in New Relic One's Global Distributed Tracing is that you can come in with vague context for what you're investigating and search across all accounts and traces to get down to the actual user interactions that have those attributes somewhere in their trace. You no longer must begin with the application or specific entity to search for traces that include that application.

For example, start with an analysis of your complex service calls, using the distributed tracing UI to:

  • View a scatter plot chart showing the frequency, duration, and other facets of your distributed traces
  • Group traces by root entry, service, service entry, or traces with errors
  • View a trace list
  • Filter specific traces to meet certain parameters

You'll likely discover that you want to take the analysis one step further by annotating your traces with information that adds context to your troubleshooting, like User ID. You can do this in New Relic using custom attributes.


one.newrelic.com > APM > Distributed tracing: use the distributed tracing UI to monitor and analyze modern distributed systems.

5. Annotate message queues with distributed tracing payload APIs

To see connections between services in some environments—for example, in a system that relies heavily on queues—you may need to do some manual instrumentation using the distributed tracing payload APIs to ensure you're propagating the payload. This gives agents the necessary context to create spans with the right correlation; you'll see end-to-end traces for all linked services, including those that cross through the queue.

6. Annotate and tag traces with custom attributes

We recommend that you use custom attributes to decorate events with additional information for better tracing. For example, by adding key-value pairs, you can attach a user ID to trace a specific user through the call stack and review failing requests to determine if that user is having an unusually poor experience.

We recommend adding custom attributes based on your use case; for example, if your instrumenting an order management system, you could add an order number custom attribute to your traces.

To add custom attributes, you must first enable them for your agent, and then make an API call to record the attribute.


For more agent-specific information on collecting custom attributes, see Collect custom attributes

sqs.sendMessage(params, function(err, data) {
if (err) {
res.send("Error: "+ err);
} else {
res.send("Success! Message ID: "+ data.MessageId);
newrelic.addCustomAttribute("Message ID", data.MessageId)

7. Leverage Synthetics to get a high-level view of system health

In complex, distributed systems, you need to track and monitor many signals. Sometimes it may be that no one signal is concerning, yet your whole system is behaving anomalously. To get a complete picture, it's critical to analyze symptomatic data in tandem with system-level data. Synthetics allows you to interact with the entire system as an external user would, giving your teams high-level checks for performance and user experience. These external checks help you understand if the entire system is doing what you want regardless of what specific signals may indicate.

For more help

If you need more help, check out these support and learning resources:

Create issueEdit page
Copyright © 2021 New Relic Inc.