Incident orchestration: Align teams, tools, processes

Incident orchestration is the alignment of teams, tools, and processes to prepare for incidents and outages in your software. The goal is to provide your teams a predictable framework and process to:

  • Maximize efficiency in communication and effort.
  • Minimize the overall impact to your business.

Prerequisites

Before starting this tutorial, be sure to complete these New Relic tutorials:

1. Assign first responders to team dashboards

Recommendation: For each team dashboard, make sure:

  • It has an owner who assumes responsibility for the health of the applications and features it monitors.
  • There is no ambiguity about who is responsible for attending to and resolving an alert condition.

This policy will vary between organizations depending on size, structure, and culture.

For example, some teams may prefer to assign dashboards and alerts based on de-facto features or application ownership. Other teams may prefer to adopt an on-call rotation (often referred to as "pager duty"). In on-call rotations, designated team members handle all first-line incident responses, and they resolve or delegate responsibilities based on predetermined incident thresholds.

2. Determine incident thresholds for alert conditions

For each of your applications:

  1. Identify the thresholds for what is officially considered an incident.
  2. As you create alert policies with New Relic Alerts, make sure each set of threshold criteria is context-dependent.
  3. Document incident evaluation and known remediation procedures in runbooks.
  4. Include links to your runbooks when you define conditions and thresholds for your alert policies.

For instance, a certain alert condition may be dismissable during low-traffic periods but require active remediation during peak hours.

3. Ensure alerts have auditable notification channels

Recommendation: Make sure that communications during critical incidents take place in easily accessible and highly visible channels. A group chat room dedicated to incident communication is usually a great choice. This allows all stakeholders to participate or observe and provides a chronology of notifications, key decisions, and actions for postmortem analysis.

Use any of the available notification channels in New Relic Alerts. For example, to set up a notification channel in Slack:

  1. Make sure your organization has completed New Relic's integration requirements with Slack.
  2. In the Slack app, select the dropdown in the top-left corner of the app, and select Customize Slack.
  3. Click Configure apps.
  4. From the list of app integrations, select New Relic.
  5. Expand the instructions for New Relic Alerts, and follow the steps to configure notifications from New Relic.

4. Automate triage and resolution steps

Automation of simple or repeatable incident response tasks will increase efficiency and minimize the impact of incidents. With proper automation in place, you can disable or isolate faulty application components as soon as an alert threshold is reached, rather than after a notification has been issued.

For example, a team managing an application for a digital media company wants to be able to remove commenting abilities from the website if the system has errors. In this case, they could:

  1. Add an endpoint to their front-end web application that will toggle a feature flag enabling or disabling the UI components associated with posting comments on an article.
  2. Create an alert policy with a threshold set on the sustained error rate in the commenting service.
  3. Assign a webhook notification channel that will send a POST request to this endpoint, as well as to the standard team notification channels.

In this scenario, errors in the commenting system will trigger the webhook and remove the commenting UI from the website. Users can still use core functionality of the site without seeing errors generated by the commenting service. The application will maintain a stable but degraded state, allowing the team to focus on recovery without the pressure of preventing users from accessing the site.

You can also use webhooks to create issues and action items in ticketing systems that have REST APIs, such as Zendesk. Use New Relic Alerts to create a webhook notification channel, and customize the webhook payload as needed.

New Relic also provides integrations for common ticketing systems, including:

You can use any of these integrations to file tickets from New Relic APM.

5. Establish a postmortem process

After the incident has been resolved, key stakeholders and participants must capture accurate and thorough documentation of the incident. Recommendation: At a minimum, make sure the documentation includes:

  • A root cause analysis
  • A chronology and summary of remediation steps and their result, whether they were successful or not
  • A measure of the impact to the business in terms of user experience and financial losses, if possible
  • Recommendations for system or feature improvements to prevent a recurrence
  • Recommendations for process and communication improvements

Store postmortem reports in a highly visible, searchable repository, such as a shared drive folder or wiki. Culturally, it is essential that this process focuses on constructive learning and improvement rather than punishment or blame.

Here is a brief example of a postmortem report:

Example postmortem report Comments
Date March 1, 2018
Executive summary From approximately 1:45PM until 2:30PM, users could not add items to their carts, which prevented any checkouts from occurring during the incident period.
Root cause We determined that a change was made to the CSS rules on the product detail page that effectively disabled the Add to cart button.
Timeline
  • 1:50PM: Successful checkouts < 10 for 5 minutes alert triggered; assigned to Alice.
  • 1:55PM: After reviewing the ecommerce team dashboard, Alice determined that the threshold was breached immediately following a deploy by Bob. She notified him of the incident.
  • 2:00PM: Alice and Bob began troubleshooting. Attempts at recreating the issue in production were successful.
  • 2:20PM: Bob determined that his change to the CSS on the product detail page disabled the Add to cart button. He deployed a hotfix.
  • 2:30PM: Functionality was restored and the incident was resolved.
Impact No checkouts were completed during the duration of the incident. Our typical revenue for a Thursday during this timeframe is $30,000.
Recommendations We have been discussing implementing New Relic Synthetics for awhile now. If we had a Synthetic check on the checkout process, this issue would have been detected immediately. We should also implement more thorough unit tests in the front-end web app.

Expert tip

In addition to instrumenting and measuring application and infrastructure metrics, mature DevOps organizations often measure and optimize the efficiency of incident response processes. For example, you can use webhooks to send alert events to New Relic Insights. This allows you to supplement your team dashboards with New Relic Alerts data.

For more help

Recommendations for learning more: