This guide is an introduction to improving your skill at diagnosing customer-impacting issues. You'll be able to recover from application performance issues more quickly by following the procedures in this guide.
This guide is part of our series on observability maturity.
Here are some requirements and some recommendations for using this guide:
- New Relic observability coverage:
- Required: Service level management
- Recommended: Some experience with using New Relic APM, distributed tracing, NRQL querying, and log management UI
- Recommended: you've read these guides:
The value to business is:
- Reduce the number of business-disrupting incidents
- Reduce the time required to resolve issues (MTTR)
- Reduce the operational cost of incident
The value to IT operations and SRE is:
- Improve time to understand and resolve
Gartner estimates that the average cost of IT downtime is $5,600 per minute. The cumulative cost of business-impacting incidents is determined by factors like time-to-know, frequency, time-to-repair, revenue impact, and the number of engineers triaging and resolving the incidents. Simply put, you want fewer business-impacting incidents, shorter incident duration, and faster diagnostics, with fewer people needed to resolve performance impacts.
Ultimately, the business goal is to maximize uptime and minimize downtime where the cost of downtime is:
Downtime minutes x Cost per minute = Downtime cost
Downtime is driven by the number of business-disrupting incidents and their duration. Cost-of-downtime includes many factors but the most directly measurable are operational cost and lost revenue.
The business must drive a reduction in the following:
- number of business-disrupting incidents
- operational cost of incidents
The required operational outcome is to maintain product-tier service level objective compliance. You do this by diagnosing degraded service levels, communicating your diagnosis, and performing rapid resolution. But unexpected degradations and incidents always happen and you need to respond quickly and effectively.
In other guides in this series, we focus on improving time to know. In our Alert quality management guide, we focus on reactive ways to improve time to know, and in our Service level management guide we focus on proactive methods.
In the guide you're reading now, we focus on improving time to understand and time to resolve.
There are many metrics discussed and debated in the world of “incident management” and SRE theory; however, most agree that it is important to focus on a small set of key performance indicators.
The KPIs below are the most common indicators used by successful SRE and incident management practices.
How the customer perceives performance of your product is critical to understanding how to measure urgency, and priority. Also, understanding the customer's perspective also helps to understand how the business views the problem, as well as understanding the required workflows to support the impacted capabilities. Once you understand the perception of the customer and the business, you can better understand what might be impacting the reliability of said capabilities.
Ultimately, observability from the customer perspective is the first step to becoming proactive and proficient in reliability engineering.
There are two primary experiences that impact an end-user's perception of the performance of your digital product and its capabilities. The terms below are from the customer's perspective using common customer language.
Availability is also known as: connectivity, uptime, reachability. But it's also conflated with success (non-errors).
An end-user may state that they cannot access a required capability, such as login, browse, search, view inventory. Or they may simply state that the entire service is unavailable. This is a symptom of either the inability to connect to a service or a service returning an error.
Traditionally "availability" or "uptime" was measured in a binary "UP/DOWN" methodology by measuring the ability to connect to a service. The traditional method has a critical gap in that it only measures when an entire service becomes completely unavailable. This classic measure of reliability results in significant observability gaps, difficult diagnostics, and the end-users being significantly impacted before you can react.
Availability is measured by both "the ability to reach a service", also known as "uptime", AND "the ability of the service to return the expected response," (in other words, "non-error"). New Relic's observability maturity framework distinguishes the two by input performance (connectivity) and output performance (success and latency of the response).
Performance is also known as: latency, and response time.
An end user may state that the service is too slow.
For both the IT and business leaders, the term "performance" can encompass an array of issues. In New Relic's service level management, "slowness" is measured in both the "output" and "client" categories. However, the majority of slowness problems occur due to an output issue, stemming from what are traditionally called the "backend services."
The root cause of a problem is not the same as the problem. Likewise, fixing the problem does not usually mean you've fixed the root cause of the problem. It's very important to make this distinction.
When looking for a performance issue, you should first attempt to find the source of the problem by asking the question, "What changed?". The component or behavior that changed is not usually the root cause but is in fact the problem you need to resolve first. Resolving the root cause is important but usually requires a post-incident retroactive discussion and long-term problem management.
For example: service levels for your login capability suddenly drop. It's immediately found that traffic patterns are much higher than usual. You trace the performance issue to an open TCP connection limit configuration that's causing a much larger TCP connection queue. You immediately resolve the issue by deploying a TCP limit increase and some extra server instances. You solved the problem for the short term but the root cause could be anything from improper capacity planning, missed communication from marketing, or a failed load balancer configuration change.
This distinction is also made in ITIL/ITSM Incident management versus Problem management. Root causes are discussed in post-incident talks then resolved in longer-term problem management processes.
The first rule is to quickly establish the problem statement. There are plenty of guides on building problem statements but simple and effective is the best. A well formed problem statement will do the following:
- Describe what changed for the end-user, customer, or consumer. What could the end-user or consumer do before that they cannot do now? This is the customer's perception of the problem.
- Describe the expected behavior of the product capability. This is a technical assessment of the problem.
- Describe the current behavior of the product capability. This is a technical description of the desired result of a fix.
Avoid any assumptions in your problem statement. Stick to the facts.
The "source" is the component or code that is closest to the problem causing the symptoms.
Think of many small water hoses connected through many junctions, splitters, and valves. You're alerted that your water output service level is degraded. You trace the problem from the water output through the components until you determine which junction, split, valve, or hose is causing the problem. You discover one of the valves is shorted. That valve is the source. The short is the problem (direct cause).
This is the same concept for diagnosing complex technology stacks. If your login capability is limited (output), you must trace things back to the component that's causing that limit and fix it. It could be the API software (service boundary), the middleware services, the database, resource constraints, a third party service, or something else.
In IT there are three primary breakpoints sources: Output, Input, and Client. Measuring these categories is covered in our service level management guide. To understand how to use them in diagnostics, keep reading.
Once you're close to the source of the problem, determine what changed. This will help you quickly determine how to immediately resolve the problem in the short term. In the example in Step 2, the change was that the valve was no longer functioning due to degraded hardware causing a short.
Examples of common changes in IT are:
- Througput (traffic)
- Code (deployments)
- Resources (hardware allocation)
- Upstream or downstream dependency changes
- Data volume
For other common examples of performance-impacting problems, see the problem matrix below.
There are three primary performance categories that jump-start your diagnostic journey. Understanding these health data points will significantly reduce your time to understanding where the source of the problem is.
This requires: APM
Output performance is the ability of your internal technology stack to deliver the expected responses (output) to the end-user. This is traditionally referred to as the "back-end" services.
In the great majority of scenarios, output performance is measured simply by the speed of the response and the quality of the response (in other words, it's error free). Remember the user perspective described above. The end-user will state that the service is either slow, not working, or inaccessible.
The most common problem is the ability to respond to end-user requests in a timely and successful manner.
This is easily identified by a latency anomaly or error anomaly on the services that support the troubled product capability.
This requires: Synthetics
Input performance is simply the ability of your services to receive a request from the client. This is not the same as the client's ability to send a request.
Your output performance, your back-end services, could be exceeding expected performance levels. However, something between the client and your services is breaking the request-response lifecycle. This could be anything between the client and your services.
This requires: Browser monitoring and/or mobile monitoring
Client performance is the ability for a browser and/or mobile application to both make requests and render responses. Browser and/or mobile are easily identified as the source of the problem once both output (back-end) and input performance (synthetics) have been quickly ruled out.
Output and input performance is relatively easy to rule out (or rule in). Due to the depth of diagnostics in input and output diagnostic, browser and mobile will be covered in an advanced diagnostics guide in the future.
The problem matrix is a cheat-sheet of common problems categorized by the three health data points.
The problem sources are arranged by how common they are, with the most common being in the top row and on the left. A more detailed breakdown is listed below. Service level management done well will help you rule out two out of three of these data points quickly.
This table is a problem matrix sorted by health data point:
|Data point||Platform capability||Common problem sources|
|Output||APM, infra, logs, NPM||Application, data sources, hardware config change, infrastructure, internal networking, third party provider (AWS, GCP)|
|Input||Synthetic, logs||External routing (CDN, gateways, etc), internal routing, things on the internet (ISP, etc..).|
|Client||Browser, mobile||Browser or mobile code|
Problems tend to be compounded but the goal is to "find the source" and then determine "what changed" in order to quickly restore service levels.
For example, a significant increase in requests due to a recent deployment of a new product causes unacceptable response times. The source was discovered in the login middleware service. The problem is TCP queue times jumped.
Here's a breakdown of this situation:
- Category: output performance
- Source: login middleware
- Problem: TCP queue times from additional request load
- Solution: increased TCP connection limit and scaled resources
- Root-cause: insufficient capacity planning and quality assurance testing on downstream service impacting login middleware
Here's another example situation:
- There was a sudden increase in 500 Gateway errors on login...
- The login API response times increased to the point where timeouts began...
- The timeouts were traced to the database connections in the middleware layer...
- Transaction traces revealed signficant increase in number of database queries per login request...
- A deployment marker was found for a deployment that happened right before the problem.
Here's a breakdown of this situation:
- Category: Output performance degredation leading to input performance failure
- Source: middleware service calling database
- Problem: 10x increased database queries after code deployment
- Solution: deployment rollback
- Root-cause: insufficient quality assurance testing
The below table is a problem matrix sorted by sources.
Internal networking and routing
There's plenty of great information and free online classes on identifying patterns, but it's a rather simple process that enables a powerful skill for diagnostics.
The key to identifying patterns and anomalies in performance data is that you don't need to know how the service should be performing: You only need to determine if the recent behavior has changed.
The examples provided in this section use response time or latency as the metric but you can apply the same analysis to almost any dataset, such as errors, throughput, and queue depths.
Below is an example of a seemingly volatile response time chart (7 days) in APM. Looking closer, you can see that the behavior of the response time is repetitive. In other words, there's no radical change in the behavior across a 7-day period. The spikes are repetitive and not unusual compared to the rest of the timeline.
In fact, if you change the view of the data from average over time to percentiles over time, it becomes even more clear how "regular" the changes in response time are.
This chart shows an application response time that seems to have increased in an unusual way compared to recent behavior.
This can be confirmed by using the week-over-week comparison.
We see that the pattern has changed and that it appears to be worsening from last week's comparison.
Once you've identified a negative change in recent behavior patterns you can then trace that behavior to its relative source.
This is where you refer back to the problem matrix. In most cases the problem is output performance in latency or errors that are impacting the user experience. The question is if the endpoint API application (the first application that receives the request) is causing the problem or if it's a dependency behind that endpoint.
Find the application causing the latency or errors experienced by the end-user. This doesn't mean the application code is the problem, but finding that application leads us close to the source. You can then rule out components starting with code, database, configuration, resources, and similar.
Today we can use distributed tracing but first we must identify the source transaction. This means we must look at APM transactions analysis in order to identify if one, some, or all the transactions are affected by latency.
Once the impacted transactions are identified, you'd search for distributed traces of that transaction. Those distributed traces will give you a more complete and holistic view of the end-to-end services within that transaction group. Distributed tracing will help you more quickly identify where the latency is or where the errors are occurring.
Here's a summary of the steps described above:
- Examine applications supporting the affected capabilities using APM.
- Identify if the problem is latency or errors.
- Trace the latency or error to the source application.
- Find the problem.
The below recap is a summary of all the steps to help start your reliablity diagnostic journey:
- Use service level management to measure product level health data points on capabilities.
- Use operational key performance indicators to measure success.
- Use performance patterns to help identify the abnormal behavior in errors, latency or throughput.
- Define clear problem statements.
- Find the source application affecting latency or errors.
- Determine the change that is impacting latency or errors. That change is the direct cause, also known as the "problem."
Now go forth and be a great SRE!