Incident orchestration is the alignment of teams, tools, and processes to prepare for incidents and outages in your software. The goal is to provide your teams a predictable framework and process to:
- Maximize efficiency in communication and effort.
- Minimize the overall impact to your business.
Before starting this tutorial, be sure to complete these New Relic tutorials:
Recommendation: For each team dashboard, make sure:
- It has an owner who assumes responsibility for the health of the applications and features it monitors.
- There is no ambiguity about who is responsible for attending to and resolving an alert condition.
This policy will vary between organizations depending on size, structure, and culture.
For example, some teams may prefer to assign dashboards and alerts based on de-facto features or application ownership. Other teams may prefer to adopt an on-call rotation (often referred to as "pager duty"). In on-call rotations, designated team members handle all first-line incident responses, and they resolve or delegate responsibilities based on predetermined incident thresholds.
For each of your applications:
- Identify the thresholds for what is officially considered an incident.
- As you create alert policies with New Relic Alerts, make sure each set of threshold criteria is context-dependent.
- Document incident evaluation and known remediation procedures in runbooks.
- Include links to your runbooks when you define conditions and thresholds for your alert policies.
For instance, a certain alert condition may be dismissable during low-traffic periods but require active remediation during peak hours.
Recommendation: Make sure that communications during critical incidents take place in easily accessible and highly visible channels. A group chat room dedicated to incident communication is usually a great choice. This allows all stakeholders to participate or observe and provides a chronology of notifications, key decisions, and actions for postmortem analysis.
Use any of the available notification channels in New Relic Alerts. For example, to set up a notification channel in Slack:
- Make sure your organization has completed New Relic's integration requirements with Slack.
- In the Slack app, select the dropdown in the top-left corner of the app, and select Customize Slack.
- Click Configure apps.
- From the list of app integrations, select New Relic.
- Expand the instructions for New Relic Alerts, and follow the steps to configure notifications from New Relic.
Automation of simple or repeatable incident response tasks will increase efficiency and minimize the impact of incidents. With proper automation in place, you can disable or isolate faulty application components as soon as an alert threshold is reached, rather than after a notification has been issued.
For example, a team managing an application for a digital media company wants to be able to remove commenting abilities from the website if the system has errors. In this case, they could:
- Add an endpoint to their front-end web application that will toggle a feature flag enabling or disabling the UI components associated with posting comments on an article.
- Create an alert policy with a threshold set on the sustained error rate in the commenting service.
- Assign a webhook notification channel that will send a
POSTrequest to this endpoint, as well as to the standard team notification channels.
In this scenario, errors in the commenting system will trigger the webhook and remove the commenting UI from the website. Users can still use core functionality of the site without seeing errors generated by the commenting service. The application will maintain a stable but degraded state, allowing the team to focus on recovery without the pressure of preventing users from accessing the site.
You can also use webhooks to create issues and action items in ticketing systems that have REST APIs, such as Zendesk. Use New Relic Alerts to create a webhook notification channel, and customize the webhook payload as needed.