Set proactive alerts and align teams, tools, and processes for incident response

The term alerting often carries some negative connotations: many developers correlate alerting with errors, mistakes, and ongoing issues. However, developers who are proactive about alerting, know they don't have to stare at dashboards all day because effective alerts tell them when to check in.

Well-defined alerts help you understand the health of your systems, so you can respond to performance problems before they affect your customers. Further, a focused alerts policy helps you pinpoint any degradation that could impact performance in your application or environment. With proactive alerting, you'll decrease user-reported incidents, and your teams will spend less time firefighting and more time deploying significant changes to your product.

After you define the right alerts, proper incident orchestration aligns your teams, tools, and processes to prepare for incidents and outages in your software. The goal is to provide your teams a predictable framework and process to:

  • Maximize efficiency in communication and effort.

  • Minimize the overall impact to your business.

Prerequisites

This tutorial assumes you have:

1. Define required alerting policies based on SLOs

A service level objective (SLO) is an agreed upon means of measuring the performance of your service. The SLO defines a target value of a specified quantitative measure, which is called a service level indicator (SLI). Examples of service level indicators could be average response time, response time percentile, and application availability. SLOs would then clarify a target value for those SLIs such as:

  • Average response time should be less than 200 ms.

  • 95% of requests should be completed within 250 ms.

  • Availability of the service should be 99.99%.

These SLOs can also be logically grouped together to provide an overall boolean indicator of whether the service is meeting expectations or not (for example, 95% of requests completed within 250 ms AND availability is 99.99%), which can be helpful for alerting purposes.

Use these SLIs as key performance indicators (KPIs) so your team and organization can measure your service and ensure it's meeting customer expectations. By breaking down the quantitative performance metrics that are required of your services, you can then identify what kinds of alerts you should set for each. For instance, you could set an alert to notify you if web transaction times go above half a millisecond, or if the error rate goes higher than 0.20%.

However, not every SLO needs to become an alert. A strong alerting strategy takes SLOs and creates a set of simple, actionable alerts. New Relic often finds that our most mature customers set fewer alerts in general and focus those alerts on a core set of metrics that indicate when their customer experience is truly degraded. As a result, New Relic customers often use Apdex as part of their alerting strategy to align alerts with signs of degraded user satisfaction.

As you design your alerting strategy, keep this question in mind: “If the customer isn’t impacted, is it worth waking someone up?”

Example questions and KPI solutions

For a simple framework of areas to set alerts for, use the following questions and advised metrics and KPIs:

Questions Metrics and KPIs

Are we open for business?

New Relic Browser and New Relic APM can be used to alert on site availability.

How's our underlying infrastructure?

Set KPIs for key hardware, network, and storage components.

How's the health of our application?

Track metrics for JVM performance, queuing, caching, and similar dependencies.

How’s the overall quality of our application?

Use an Apdex score to quickly access an application’s quality.

How are our customers doing?

Consider real end-user metrics (Browser or APM), synthetic users (Synthetics), and Apdex scores.

How's our overall business doing?

Focus on key transactions within an application, and tie them to expected business outcomes to illustrate correlation between your application and business performance.

2. Set specific alerts for performance, correctness, throughput, availability, and dependencies

With New Relic you can set alerts on your instrumented applications, end-user experience, infrastructure, databases, and more. New Relic will alert you if your site's availability dips or if your error rate spikes above acceptable levels, as defined by your SLOs. You can set warning thresholds to monitor issues that may be approaching a critical severity but don't yet warrant a pager notification.

Setting thresholds for when alerts should notify teams can be challenging. Thresholds that are too tight will create alert fatigue while thresholds that are too loose will lead to unhappy customers.

Baseline alerts allow you to set dynamic thresholds for alerts based on historical performance. Use baselines to tune your alert to the right threshold. For example, an alert in APM can notify incident response teams if web transaction times deviate from historical performance for an allotted amount of time.

Proactive baseline alerts for devops
alerts.newrelic.com > (selected alert policy) > (selected alert condition) > Define thresholds

You can set this same kind of alert in Browser to catch sub-optimal performance. In the following example, we've set both a warning and a violation for throughput:

Proactive baseline alerts - Browser thresholds
alerts.newrelic.com > (selected alert policy) > (selected alert condition) > Define thresholds

As you develop smaller, independent services running on increasingly ephemeral architectures, your environments become significantly more complex. Visibility into outliers can be an important tool for understanding likely performance issues. You should set alerts to automatically fire when you have an outlier, which can indicate misbehaving hosts, load balancers, or apps.

For example, a load balancer divides web traffic across five different servers. You can set an alert based on a NRQL query and receive a notification if any server starts getting significantly more or less traffic than the other servers. Here’s the graph:

outlier-alert.png
alerts.newrelic.com > (selected alert policy) > (selected alert condition) > Define thresholds

And here's a sample NRQL query:

SELECT average(cpuPercent) FROM SystemSample WHERE apmApplicationNames = 'MY-APP-NAME' FACET hostname

Now you have set static, baseline, and outlier alerts. This can provide a comprehensive awareness of your ecosystem.

Refer to the New Relic Alerts documentation for more details about optimizing your alerts.

3. Identify groups to alert, set broadcasting methods, and assign first responders to team dashboards

Alerting without the proper broadcasting methods leaves you vulnerable. Your alerting strategy should include a notification channel to ensure the appropriate teams are notified if your application or architecture encounters issues. New Relic has many notification integrations, but we recommend that you start simple and add more complexity later.

We recommend that you first send alerts to a group chat channel (for example, using Slack or HipChat). Evaluate these alerts in real time for several weeks to understand which alerts are indicative of important or problematic issues. These are the types of alerts that warrant waking someone up.

Proactive baseline alerts channels
alerts.newrelic.com > (selected alert policy) > Notification channels

Recommendation: For each team dashboard, make sure:

  • It has an owner who assumes responsibility for the health of the applications and features it monitors.

  • There is no ambiguity about who is responsible for attending to and resolving an alert condition.

This policy will vary between organizations depending on size, structure, and culture.

For example, some teams may prefer to assign dashboards and alerts based on de-facto features or application ownership. Other teams may prefer to adopt an on-call rotation (often referred to as pager duty). In on-call rotations, designated team members handle all first-line incident responses, and they resolve or delegate responsibilities based on predetermined incident thresholds.

4. Determine incident thresholds for alert conditions

For each of your applications:

  1. Identify the thresholds for what is officially considered an incident.

  2. Make sure each set of threshold criteria is context-dependent.

  3. Document incident evaluation and known remediation procedures in runbooks.

  4. Include links to your runbooks when you define conditions and thresholds for your alert policies.

For instance, a certain alert condition may be dismissable during low-traffic periods but require active remediation during peak hours.

5. Ensure alerts have auditable notification channels

Make sure that communications during critical incidents take place in easily accessible and highly visible channels. A group chat room dedicated to incident communication is usually a great choice. This allows all stakeholders to participate or observe and provides a chronology of notifications, key decisions, and actions for postmortem analysis.

Use any of the available notification channels in New Relic Alerts. For example, to set up a notification channel in Slack:

  1. Make sure your organization has completed New Relic's integration requirements with Slack.

  2. In the Slack app, select the dropdown in the top-left corner of the app, and select Customize Slack.

  3. Click Configure apps.

  4. From the list of app integrations, select New Relic.

  5. Expand the instructions for New Relic Alerts, and follow the steps to configure notifications from New Relic.

6. Automate triage and resolution steps

Automation of simple or repeatable incident response tasks will increase efficiency and minimize the impact of incidents. With proper automation in place, you can disable or isolate faulty application components as soon as an alert threshold is reached, rather than after a notification has been issued.

For example, a team managing an application for a digital media company wants to be able to remove commenting abilities from the website if the system has errors. In this case, they could:

  1. Add an endpoint to their front-end web application that will toggle a feature flag enabling or disabling the UI components associated with posting comments on an article.

  2. Create an alert policy with a threshold set on the sustained error rate in the commenting service.

  3. Assign a webhook notification channel that will send a POST request to this endpoint, as well as to the standard team notification channels.

In this scenario, errors in the commenting system will trigger the webhook and remove the commenting UI from the website. Users can still use core functionality of the site without seeing errors generated by the commenting service. The application will maintain a stable but degraded state, allowing the team to focus on recovery without the pressure of preventing users from accessing the site.

You can also use webhooks to create issues and action items in ticketing systems that have REST APIs, such as Zendesk. Use New Relic Alerts to create a webhook notification channel, and customize the webhook payload as needed.

New Relic also provides integrations for common ticketing systems. You can use any of these integrations to file tickets from New Relic APM.

7. Establish a postmortem process

After the incident has been resolved, key stakeholders and participants must capture accurate and thorough documentation of the incident.

At a minimum, we recommend that the retro documentation includes:

  • A root cause analysis

  • A chronology and summary of remediation steps and their result, whether they were successful or not

  • A measure of the impact to the business in terms of user experience and financial losses, if possible

  • Recommendations for system or feature improvements to prevent a recurrence

  • Recommendations for process and communication improvements

Store postmortem reports in a highly visible, searchable repository, such as a shared drive folder or wiki. Culturally, it's essential that this process focuses on constructive learning and improvement rather than punishment or blame.

Example postmortem report

Here is a brief example of a postmortem report:

Postmortem Comments

Date

March 1, 2018

Executive summary

From approximately 1:45PM until 2:30PM, users could not add items to their carts, which prevented any checkouts from occurring during the incident period.

Root cause

We determined that a change was made to the CSS rules on the product detail page that effectively disabled the Add to cart button.

Timeline

  • 1:50PM: Successful checkouts < 10 for 5 minutes alert triggered; assigned to Alice.

  • 1:55PM: After reviewing the ecommerce team dashboard, Alice determined that the threshold was breached immediately following a deploy by Bob. She notified him of the incident.

  • 2:00PM: Alice and Bob began troubleshooting. Attempts at recreating the issue in production were successful.

  • 2:20PM: Bob determined that his change to the CSS on the product detail page disabled the Add to cart button. He deployed a hotfix.

  • 2:30PM: Functionality was restored and the incident was resolved.

Impact

No checkouts were completed during the duration of the incident. Our typical revenue for a Thursday during this timeframe is $30,000.

Recommendations

We have been discussing implementing New Relic Synthetics for awhile now. If we had a Synthetic check on the checkout process, this issue would have been detected immediately. We should also implement more thorough unit tests in the front-end web app.

8. Fine-tune alerts and thresholds

As you use New Relic to optimize your application and infrastructure performance, tighten your New Relic Alerts policy conditions to keep pace with your improved performance.

When rolling out new code or modifications that could negatively impact performance over a period of time, loosen your threshold conditions to allow for these short-term changes. For instance, we recommend using pre-established baselines and thresholds to increase efficiency during high-impact times for your business, such as annual events and major releases. Fine-tuning gives you the flexibility you need to increase efficiencies and extend your notification channels.

As noted earlier, we recommend you start with a group chat service when you first establish notifications. Once you've identified other tools you'd like to integrate with, set up a notification channel to maintain your momentum. Tools such as xMatters and PagerDuty provide popular integrations, but don't overlook simple methods, such as webhooks. The goal is to continuously improve your alerting scheme.

Be sure to check your alerts and confirm that they're firing regularly and are still relevant for your customer satisfaction metrics. Use the New Relic platform to create dashboards centered around alerts and incidents for the most common policy conditions and violations.

Proactive baseline alerts - dashboards
insights.newrelic.com > All dashboards > (selected dashboard)

Use the New Relic query language and the Insights query API to create your dashboards. For detailed instructions, check out Sending alerts data to Insights.

The dashboard above was created using the following NRQL queries. It's recommended you recreate them as needed for your alerting dashboards.

  • Incidents by condition:

    SELECT count(*) FROM ALERT_NAME WHERE current_state = 'open' FACET condition_name SINCE 1 week ago
    
  • Incidents by policy:

    SELECT count(*) FROM ALERT_NAME where current_state = 'open' FACET policy_name SINCE 60 MINUTES AGO TIMESERIES
  • Alert trends over time:

    SELECT count(*) FROM ALERT_NAME WHERE current_state IS NOT NULL FACET policy_name SINCE 1 week ago TIMESERIES
  • Incident details:

    SELECT timestamp, incident_id, policy_name, condition_name, details, severity FROM ALERT_NAME SINCE 1 week ago LIMIT 40

Visualizing this data provides a resource you can share with others to refine the alerts and thresholds you're using.

Expert tip

In addition to instrumenting and measuring application and infrastructure metrics, mature DevOps organizations often measure and optimize the efficiency of incident response processes. For example, you can use webhooks to send alert events to New Relic Insights. This allows you to supplement your team dashboards with New Relic Alerts data.

For more help

Recommendations for learning more: