This guide walks you through how to improve and optimize the quality of your service level management. It's part of our series on observability maturity.
Service level management is the practice of standardizing data into a universal language that can be communicated easily to all stakeholders. IT does not usually speak business and business does not usually speak IT, so an observability language barrier must be resolved first in order to improve reliability.
This need for a universal language to articulate reliability is what has re-popularized service level management. Service level management is known best in the practice of Uptime, performance and reliability; however, service level management also applies to the other practices of Customer experience, Innovation and growth, and Operational efficiency (learn more about these practices).
This implementation guide will teach you the practice of service level management in the context of the Uptime, performance and reliability practice.
The required business outcome in the practice of reliability is to reduce the number of business-impacting incidents, their duration, and the number of people involved in those incidents.
- Reduce the number of business-disrupting incidents
- Reduce mean time to resolution (MTTR)
- Reduce average people engaged (FTEs) per severe incident
The required operational outcome of service level management within a reliability practice is to communicate digital product health successfully. Operational success is measured by what percentage of critical product applications are covered by standard service levels and percentage of adoption by primary stakeholders. This is achieved by staying focused on what is important to the stakeholders, standardization, ensuring simplicity, a sprinkle of consulting, and proving the effectiveness of service level management.
Here are some terms it may help you to understand before using this guide:
Key performance indicators
What is service health?
Health is the sum of the performance of all your services measured by the final response, connectivity, and renderability at the client. The reality is that your service is a "digital product" and that product is only as good as how well it's received by your users. Your technology, as complex as it can be, is a closed system mostly unseen by the internet except for your external APIs.
The most critical health KPIs are gathered by answering the following questions:
- How fast and successfully can we deliver responses to the user?
- Can users connect to our service?
- Can our client apps render content fast and successfully?
- Can we process critical data fast and successfully?
What is not service health?
Traditionally IT was able to correlate health from hardware data such as CPU, RAM, Disk, Network, etc. However today's infrastructure technology compensates for much of that, especially load balancing and orchestrated infrastructure like Kubernetes. Today these "metrics" are considered diagnostic data and are just a portion of the failure points to look at when application performance issues are detected.
Service health is not exclusively infrastructure data, nor is it exclusively end-user client performance data. Interestingly, many look first to real user monitoring (RUM) or end-to-end transaction data (distributed tracing) to identify health but are left with many more questions.
Learn how to react (reactive) before you pro-act (proactive)
It's common to debate reactive and proactive practices when establishing initial health data. For example, if I know the hardware performance of our infrastructure, I can predict failure. There was a time the previous statement was true when monolith architecture was dominant. This is not entirely true today because distributed systems do not have a linear relationship (1:1) between hardware performance and the output performance of applications that sit on that hardware.
In order to truly be proactive you must first establish real input and output datapoints. You must first know what to measure before you can react. You must first learn how to react before you can pro-act (be proactive). Keep it simple and build your skills incrementally. This guide will show you the fastest path to proactive behaviors.
The reality is that starting at your application layer, closest to the customer but before the client device, is the fastest path to observing health data from the customer's perspective.
Primary health datapoints and their KPIs
You'll learn to establish and measure service level objectives for output performance and input performance in this guide as defined below.
Client performance is covered in our implementation guide on Customer experience: Quality foundation.
Data quality is not covered in this guide because each use case varies greatly depending on the data inputs, outputs, and desired results.
It's strongly recommended that you first accomplish the Output performance and Input performance steps before proceeding to the Client performance step. Output and input SLOs are very easy to create, and there is a much greater return on your investment to have output and input health data first. In addition to being easy to establish, input and output datapoints will provide you with a remediation path much sooner in your reliability journey.
- Highly recommended: our free interactive online course on this service levels guide
- Achieve basics skills with our dashboards and NRQL
- Review the service level management UI
Establishing an output performance SLI
Here's an overview of the steps for establishing an output performance SLI:
- Identify your service
- Identify your service boundary
- Establish your baseline
- Create your service level
We'll now describe these steps in more detail.
1. Identify your service
The following is assumed:
- Your primary applications are instrumented with New Relic APM agents.
- Your application names follow a familiar naming convention, as outlined in our Service characterization guide.
- You can find your application in our entity explorer.
In the entity explorer, find your application (entity type of
Services - APM) and select it. You should see the overview screen below. Don't click
Service levels yet.
2. Identifying your service boundary
The goal here is to ensure you're measuring the output of your service, first. While dependencies of that application each play a part in response times and success rates, the final and total response time and success is easily measured at the point where the request is received and responded to.
In the screenshot below you are responsible for all applications that support order processing. You selected #2 (Order-Composer) to start, clicked Service maps, and discovered that Order-Composer is really a dependency; therefore, you will need to select #1 (Order-Processing) in order to establish a true health service level.
Your team may only be accountable for the dependency, Order-Composer. If that's the case, then your own service level on Order-Composer is perfectly acceptable for your own self-monitoring of performance. Be sure to tag your own non-customer facing service levels as
customer-facing:false to allow for better filtering in health reports. Also, consider collaborating with the customer-facing endpoint (#1 Order-Processing) in your observability journey in order to establish true output performance, an input connectivity service level, and client service levels.
3. Establishing your baseline
Establishing a baseline is a critical step to accelerating adoption and implementation of service levels. It's more challenging to determine what the design specifications are or should have been for services. Establishing a baseline allows you to measure the current performance of a service and then, through the service level reports, you will know if you are hitting that baseline or degrading.
You can create a baseline for virtually any dataset; however, there are different formulas and recommendations for different use cases. For example, you should use the average for some datasets, percentiles for others, and max for others.
When starting service levels you should start with output performance of your applications. For this we use response times (latency) and percentage of non-errors (success).
How much history should be considered? Not much, in fact. You're establishing a reliability health metric. Seasonality and peak usage is not a handicap for good performance. Also, the more history you include in your measurement, the more likely you are including different codebases from releases. Previous deployments, no matter how small, could skew your results.
The recommended history is one to two weeks of performance data to establish a fair baseline.
Here's an example NRQL query that represents the recommended target for a 7-day service level objective for latency:
FROM Transaction SELECT percentile(duration, 95) AS 'Latency Baseline SLI' WHERE appName='Order-Processing' SINCE 1 WEEK AGO
For a success (error-free) baseline, try the following query. Be sure to substitute
Order-Processing for your own application name.
FROM Transaction SELECT percentage(count(*), WHERE error is false) AS 'Success Baseline SLI' SINCE 1 WEEK AGO WHERE appName='Order-Processing'
4. Create your service level
The New Relic platform will automatically calculate recommended and baselines for you.
Note: If you don't see the Add a service level button, check with your New Relic administrator about your permissions.
The "Identifying your service" section above shows you how to find your application APM data. You'll see #2 in the screenshot in that same section, called "Service levels." Find your application APM data and click Service levels. You should see the view below.
Click Add baseline service level objectives and almost instantly you will have both your Latency SLI and Success SLI and their respective objectives created for you.
You can view and change all the settings by clicking the three dot icon in the upper-right corner of each SLO scorecard.
Note: It will take approximately 10 minutes for data to populate the SLO scores. This is because we use the events-to-metrics service for data longevity and query performance. It takes a moment for the conversion to take place and begin to populate the data.
Establishing an input performance SLI
An overview of the process of establishing an input performance SLI:
- Create your synthetic check.
- Create your service level indicator.
Below are more details on these steps.
1. Create your synthetic check
The most common input performance service level is often referred to as "connectivity" or "uptime." This is a simple check against a health API endpoint or simply loading a URL. Both of these can be done easily by using our synthetic monitoring service. Please refer to Add simple browser monitor and Add scripted API test to learn how to begin reporting data.
2. Create your service level indicator
After completing that first step, you should now have data.
Now you will use the service level management service to create an input indicator and objective.
From the entity explorer, select Service levels on the right, and then click + Add a service level indicator.
Note: If you don't see the Add a service level button, check with your New Relic administrator about your permissions.
Next filter your entity types to
Synthetic monitors. See screenshot below.
- Find your synthetic monitor in that list and click it. This will enable the Continue button in the left panel. Click Continue.
- You'll see a button for the recommended settings for a Success service level (shown below). Click it.
- Make appropriate changes to the tags, title and description as needed.
- Click Save.
Establishing a capability SLI
This is where you will really accelerate adoption of service levels!
You do not need to have intimate knowledge of an application or service in order to complete this task. You simply need to know where the consumer-facing API (service boundary) is, and follow the steps below.
This is a major step in your observability maturity journey. Having service levels on critical business capabilities, like login or authorize payment, will rapidly close the language barrier between IT and business. Service level scores on capabilities also provide you with a more precise remediation path when their service levels begin to degrade. For example, if the login service level begins to degrade, you'll know to look at identity mangement dependencies and workflows starting at the consumer-facing API.
Note: in this task you're building on the skills you learned in the "Establishing an output SLI" section.
Here's an overview of this process:
- Assess application capabilities.
- Baseline a capability.
- Create your capability service level.
These steps will be explained in more detail below.
Identify the service boundary application as described in the Establishing an output SLI section above.
Run the following NRQL query to identify baselines on the most frequently used transactions. Be sure to replace
Order-Processing with the application name you identified.
FROM Transaction SELECT count(*), percentile(duration, 95) WHERE appName='Order-Processing' FACET name SINCE 1 WEEK AGO
You should see something similar to the screenshot below.
You'll see the first transaction states it has something to do with "purchase." You can now create a "purchase" capability service level.
Note: Even if you're not sure that this transaction represents the purchase capability, this exercise makes a great example to show the application team and your leadership the value of capability service levels. Remember, your goal here is to start a conversation with the stakeholders by showing the art-of-the-possible.
WHERE name='Controller/Sinatra//purchase' to the end of your query, replacing
Controller/Sinatra//purchase with your transaction name. Run the query to make sure it works. You should now see only the one transaction in your result. Copy this query and the
DURATION (95%) result into a notepad. You'll need both in a moment.
Create a new service level in the platform. Starting a new service level is described in Establishing an input performance SLI.
In this case you want to find your application (APM
Entity type) in the list so we can retain the metadata (tags) through the entity GUID. Instead of "Synthetic monitor" as in section above, select "APM" in the entity filter dropdown.
Select the "Latency" guided workflow so the good and valid queries are auto-populated for you.
Use your notepad to copy just the
Add this condition to both queries in service levels, preceded with an
AND , as underlined in the screenshot below.
Simply adjust the
duration < 1.78 portion of the second query to match the
DURATION (95%) result in your original baseline copied to your notepad.
Proceed to name this service level, update the description, and save the service level.
It's recommended to set up a few of these capability service levels and present to the application team and your leadership for feedback.
Alert quality management
Alert quality management is another observability maturity practice that marries really well with service level management. The value of both alerting quality data side-by-side with service level data is that you can see if alert policies are aligned with real impact or are just creating noise. You'll be able to validate good alerts, missing alerts, and just noisy alerts.
You can do this by creating a custom dashboard with an SLI compliance query side-by-side with an alerting quality query.
If you haven't already, check out our Alert quality management guide.
Adoption and continuous improvement
Improvement of service levels and reliability requires adoption of the practice by all the stakeholders of the service. This includes, but is not limited to, engineering management, product management, and executive management. The primary goal is to quickly demonstrate the power and value of service levels to stakeholders in order to start a meaningful discussion on what really matters to those stakeholders. The steps in this guide will get you those meaningful discussions very quickly.
A proven method, with a high rate of adoption, is to first establish output performance and input performance service levels for one digital product and its critical capabilities. This usually involves one overall output and input service level for each endpoint application (usually one or two), and then approximately 4-7 output performance service levels for assumed critical capabilities measured at the endpoint transaction.
This method includes not surveying each stakeholder for what should and shouldn't be measured. Surveys usually result in long wait times, lots of questions, frustration, lack of demonstrated value, and suboptimal answers. Remember, start with baselines and key transactions as "capabilities."
Freely make assumptions of what these endpoints are, and what endpoint transactions make up what capabilities, as demonstrated above. Accuracy is not the key at first. What is key to a successful kick-off is demonstrating the ability to easily measure and communicate health. That initial demonstration will show the value in investing more time to refine what is and what isn't measured in primary service levels.
Don't wait. The sooner you provide that demonstration and the more complete that demonstration is, the sooner you will achieve broader adoption and begin the reliability improvement process in collaboration with all the stakeholders!
Once you have established what works (and what doesn't) for your stakeholders, you can then begin to design SLM at scale with automation. You can start learning about automating service level management by studying the New Relic Terraform library.
As stated in the "Desired outcomes" section above; the preverbial bottom-line result is to reduce the cost of business impacting incidents.
However, service levels can also help quantify both estimated revenue loss during incidents as well as estimated revenue at risk for subscription-based businesses.
Revenue loss can be easily estimated for revenue generated by transaction, such as online retail, as well as penalties paid if your business has service level agreement contracts with penalties built-in.
Revenue at risk is for subscription-based (SaaS) business models where each customer has a monthly or annual subscription value. You can easily estimate the number of customers impacted and their subscription revenue by period to calculate "revenue at risk." Note: subscription businesses can also have penalties within a service level agreement contract, which should be included as stated below.
Quantifying the direct cost of service level agreement breaches
Determine the cost of previous breaches. For example, online retail businesses know the estimated revenue loss per minute during service loss (downtime). Legal can tell you the penalty costs of service level agreement (SLA) contract breaches. Both losses can be easily estimated in real-time using New Relic data on service level breaches.
Quantifying the revenue opportunity costs of service level breaches
Determine the three variables below.
- (A) number of breaches that trigger penalties or revenue loss
- (B) average duration of breaches
- (C) average penalty or revenue loss per minute/hour
Multiply those three variables (A B C) to calculate total revenue opportunity to recover.
Quantifying revenue leakage
Determine the two variables below.
- (A) Total revenue (per period)
- (B) Total penalties payments made to customers (per same period as A)
Divide B / A to calculate revenue leakage % rate.
Tracking your service level objectives
Service levels are a practice, just like testing, alerting, game days, and other recurring practices. You could think of them as a scientific instrument that helps you to measure the "health" of your systems. But like all tools, service levels require calibration.
Include the service level practice in your team's process. The following recommendations were distilled from the experience using service levels within teams at New Relic, and you should adjust them for your specific team requirements:
- Do a periodic review of the service levels, and pay close attention to:
- Do the SLIs reflect incidents and pages?
- What's your error budget for a week?
- If it's too low, investigate what caused a drop, using the "Analyze" feature to find bad events that caused it,
- If it's 100%, make sure your indicator is correct and the SLO is aggressive enough. Being at 100% indicates the SLO is too safe.
- What is the trend that you observe in various time periods (1d/7d/28d).
- Keep an eye on SLIs during game days. SLIs should reflect the impact, just like your alerts do.
- When you have a drop of error budget on production, evaluate why it didn't happen on staging.
The next step in our observability maturity practice is to add in customer experience service levels measured at the client browser or mobile device. Again, it's important you first prove value as described in the improvement process above. Remember: observability is a journey, and maturity takes time, practice, and patience.
To proceed on your journey, see: