Incident Learning: Retrospectives

Equip teams to learn from incidents and stop recycling problems. You may have heard the term blameless post-mortems; here’s how and why to get there. Create a process for learning from problems that enables the organization to improve existing KPIs and incident response patterns and to adapt when new challenges surface. The goal is to learn first, then fix.

Yes, you have had an outage, but you now have everyone's attention. Take advantage of it.

Prerequisites

Before starting this tutorial, be sure to complete these New Relic tutorials:

1. Establish a Post-Mortem Process

The goal is to identify technical, organizational, and process follow-up actions after each notable incident.

We've found more success by focusing on what happened, rather than who did what. Ask questions like:

  • How were we notified about the problem?
  • Could we have discovered the problem sooner?
  • Was the information we needed to resolve the incident easily accessible?
  • Where did we have humans doing work where computers should have done it?

Read more at How and Why to Hold Blameless Retrospectives. Ideally, this process should be the same across teams.

Create a document that includes:

  • The triggering event.
  • Contributing factors that lead to the incident.
  • A chronology and summary of remediation steps and their result (be sure to include what went right; these are actions/processes you don't want to lose in the future).
  • A measure of the impact to the business in terms of user experience and financial losses, if possible.
  • Recommendations for system or feature improvements to prevent a recurrence.
  • Recommendations for process and communication improvements.
  • Owners of post-retro actions.

No root cause?

Why is root cause not included in this postmortem report? Root cause implies that it's both possible and useful to find a single cause for an outcome in a complex system. The notion is seductive because we want simple, actionable explanations, but current systems are too complex for simple answers. It is harmful because it tricks us into going down one narrow path of options instead of exploring the different paths available, some of which might be much more important and impactful if we broadened our thinking.

Example postmortem report Comments
Date March 1, 2018
Executive summary From approximately 1:45PM until 2:30PM, users could not add items to their carts, which prevented any checkouts from occurring during the incident period.
Root cause We determined that a change was made to the CSS rules on the product detail page that effectively disabled the Add to cart button.
Timeline
  • 1:50PM: Successful checkouts < 10 for 5 minutes alert triggered; assigned to Alice.
  • 1:55PM: After reviewing the e-commerce team dashboard, Alice determined that the threshold was breached immediately following a deploy by Bob. She notified him of the incident.
  • 2:00PM: Alice and Bob began troubleshooting. Attempts at recreating the issue in production were successful.
  • 2:20PM: Bob determined that his change to the CSS on the product detail page disabled the Add to cart button. He deployed a hotfix.
  • 2:30PM: Functionality was restored and the incident was resolved.
Impact No checkouts were completed during the duration of the incident. Our typical revenue for a Thursday during this timeframe is $30,000.
Recommendations We have been discussing implementing New Relic Synthetics for a while now. If we had a Synthetic check on the checkout process, this issue would have been detected immediately. We should also implement more thorough unit tests in the front-end web app.
Next Steps Alice will implement Synthetic checks next sprint. Ravi's team will investigate creating more unit tests.

At first, your retros will generate obvious follow-ups, like fixing permissions and access, missing instrumentation, or a need to tune certain alerts. Over time, your retros will reveal larger followups—we call them interventions. Before you rush to implement interventions (such as switching from using GUIDs to integers), take time to map out the impact of these changes. By making a major change, you may ensure the previous incident does not occur, but what other problems are you introducing or risking? Not every incident requires preventative action. It is reasonable to accept a certain level of risk and mitigate the impact.

2. Don't Over-Research

Establish guidelines for which incidents require in-depth analysis to prevent diminishing returns from too much overhead. If you had a hardware failure and replace the hardware, you are probably done. If an incident has so much complexity (or emotional heat) that you may never get to clear follow-up actions, you can choose to not do a formal review. Rather, have a meeting and an open discussion of events. Take notes, but you don't need to create a retro report.

3. Tune Your Monitoring

Audit monitoring and alerting after major incidents to tune KPIs and thresholds, improve time to detection, and reduce alert noise. Review the set proactive alerts pattern. Groom your alerts by:

  1. Measuring the frequency of pages on your teams.
  2. Using New Relic Alerts incident rollup strategies.
  3. Maintaining policy hygiene. Little things, like using a consistent naming structure, make a big difference.
  4. Leveraging custom instrumentation and alert on your critical KPIs.
  5. Understanding and using baseline conditions.
  6. Creating runbooks for your conditions. Runbooks for alerts should include:
    • A description of why the alert was created.
    • What the alert is monitoring.
    • What that alert indicates about the state of your system.
    • Initial steps for an on-call engineer to begin triage.

Learn more about grooming and tuning your alerts.

4. Create an Incident Repository

Create a centralized, searchable repository of incident post-mortem documents and other incident artifacts, providing the organization with access to lessons learned. A small number of people in your org will probably access these. But get good enough at creating these reports and you can begin to automate the use of this information.

Outcomes

As you continue to harvest information from your incidents, you should begin to get better SLO compliance, fewer high-severity incidents, improved developer satisfaction, and lower employee turnover rate.

For more help

Explore the New Relic Platform.

Recommendations for learning more: