Analyze distributed systems

In a monolithic application, a simple stack trace can contain enough diagnostic data to determine the root cause of a code defect. But cloud computing and microservices have blurred the lines between software and infrastructure; in modern architectures, requests are distributed across many smaller services—often with ephemeral lifespans—hosted in both on-premise and cloud environments. Spotting code defects becomes much more complex. New Relic APM's distributed tracing automatically helps teams troubleshoot such distributed systems.

Distributed traces are just one component of a well-monitored system. You need a holistic view of your distributed system, especially when tracking the root cause of a defect, as there are volumes of data to evaluate and understand. When managing a microservices environment, it's critical that you have the capability to spot bottlenecks and problem spans quickly so that you don't compromise your mean-time-to-resolution (MTTR) or end-user experience.

At New Relic, we understand these challenges inherently. In our journey, we've transitioned from a Ruby monolith to a multi-language distributed environment built on more than 300 microservices, for which we average 50 code deploys a day. Such challenges inform both how we've built and how we monitor New Relic.

Use the New Relic platform to translate your data into relevant insights, so you can collaborate around a common framework to build context and quickly optimize and troubleshoot your complex, distributed systems.

1. Identify high-priority areas to monitor

Distributed systems are complex. Before engaging with the New Relic platform, we recommend that you identify the most critical areas of your systems to observe, and focus on instrumenting those high-priority areas first. Google's SRE handbook suggests monitoring the "four golden signals": traffic, latency, errors, and saturation, as shown in the following dashboard:

This image shows an Insights dashboard of KPIs.
Insights: Visualize key areas to monitor with New Relic Insights.

Too often, teams monitor what is either 1) easy to measure or 2) interpretable. Avoid this fallacy. When making choices about what to monitor, involve product managers and other stakeholders from your organization. Your goal is to monitor what matters to your business, not to overload your teams with noise.

2. Instrument to get the visibility you require

Once you've identified your key priorities, instrument the appropriate parts of your system with the New Relic platform.

3. Create Insights dashboards

Use New Relic Insights dashboards to get an overview of your entire system and baseline performance, so you can better understand how the components work together. Insights provides a single framework for aligning disparate teams around relevant data. Approach this first build as your launching point, giving your teams something to react to as you begin to ascribe context to your system. You should continue to build and iterate your dashboards as you analyze, troubleshoot, and optimize your distributed system.

Even after you have a basic understanding of your system, use Insights dashboards to inform the decisions you make going forward. Insights dashboards provide central repository of truth, allowing all stakeholders to build context about the health of your system. Here is an example of an Insights query that displays the slowest applications.

SELECT percentile(duration, 99) as 'Slowest duration' FROM Transaction FACET name

Using this query you can focus on the slowest application to dig deeper:

This image shows slow durations of controllers in Insights.
Insights: After running the query, Insights displays the slowest durations.

4. Dig deeper with distributed tracing

After you have basic instrumentation and contextual dashboards in place, you can begin to dig deeper to troubleshoot or optimize your system. For example, start with an analysis of your complex service calls, using the distributed tracing UI to:

  • View a scatter plot chart showing the frequency, duration, and other facets of your distributed traces

  • Group traces by root entry, service, service entry, or traces with errors

  • View a trace list

  • Filter specific traces to meet certain parameters

You'll likely discover that you want to take the analysis one step further by annotating your traces with information that adds context to your troubleshooting, like User ID. You can do this in New Relic using custom attributes.

APm_distributed-tracing-catalyst-TWO.png
New Relic APM > Monitoring > Distributed tracing: use the distributed tracing UI to monitor and analyze modern distributed systems.

5. Annotate message queues with distributed tracing payload APIs

To see connections between services in some environments—for example, in a system that relies heavily on queues—you may need to do some manual instrumentation using the distributed tracing payload APIs to ensure you're propagating the payload. This gives agents the necessary context to create spans with the right correlation; you'll see end-to-end traces for all linked services, including those that cross through the queue.

Example annotated message queue

For example:

  1. Start a transaction.

  2. Create the distributed tracing payload on the producer service within a New Relic transaction.

  3. As part of the message, add the distributed tracing payload.

     ​​newrelic.setTransactionName("Send Message")
     // Create SQS service object
     var sqs = new AWS.SQS({apiVersion: '2012-11-05'});
     var transactionHandle = newrelic.getTransaction();
     var payload = transactionHandle.createDistributedTracePayload();
     var jsonPayload = payload.text();
    
    //add the jsonPayload as a MessageAttribute
     var params = {
        DelaySeconds: 10,
        MessageAttributes: {
          "Testing": {
            DataType: "String",
            StringValue: "123"
        },
        "TraceContext": {
          DataType: "String",
          StringValue: jsonPayload
        }
      },
      MessageBody: "Testing 123",
    QueueUrl: "https://sqs.us-west-2.amazonaws.com/408155283954/sqs-testing"
      };
  4. With the context of a transaction, receive the payload on the consumer service.

     sqs.receiveMessage(params, function(err, data) {
       if (err) {
         console.log("Receive Error", err);
       } else if (data && data.Messages) {
         console.log("trace context:")
         var traceContext = 
    data.Messages[0].MessageAttributes.TraceContext.StringValue;
         if (traceContext != "") {
           var transactionHandle = newrelic.getTransaction();
           transactionHandle.acceptDistributedTracePayload(traceContext);
         }

6. Annotate and tag traces with custom attributes

We recommend that you use custom attributes to decorate events with additional information for better tracing. For example, by adding key-value pairs, you can attach a user ID to trace a specific user through the call stack and review failing requests to determine if that user is having an unusually poor experience.

We recommend adding custom attributes based on your use case; for example, if your instrumenting an order management system, you could add an order number custom attribute to your traces.

To add custom attributes, you must first enable them for your agent, and then make an API call to record the attribute.

For more agent-specific information on collecting custom attributes, see Collect custom attributes

	
 sqs.sendMessage(params, function(err, data) {
   if (err) {
     res.send("Error: "+ err);
   } else {
     res.send("Success! Message ID: "+ data.MessageId);
     newrelic.addCustomAttribute("Message ID", data.MessageId)
   }
 });

7. Leverage Synthetics to get a high-level view of system health

In complex, distributed systems, you need to track and monitor many signals. Sometimes it may be that no one signal is concerning, yet your whole system is behaving anomalously. To get a complete picture, it's critical to analyze symptomatic data in tandem with system-level data. New Relic Synthetics allows you to interact with the entire system as an external user would, giving your teams high-level checks for performance and user experience. These external checks help you understand if the entire system is doing what you want regardless of what specific signals may indicate.

For more help

For more tips and best practices on distributed tracing and custom attributes, see the following:

Recommendations for learning more: