Incident Response Strategy for Software Teams

Summary

In the software world, incident response is characterized by the process and strategy with which software teams respond to alarms and system outages. When an alarm goes off that something isn’t working as expected, that can be fatal for the entire system. Similarly, many times we find a system not working as expected but don’t know why or what could have caused it. Responding well to these incidents is how teams can build more resilient software for the future.

Despite the many benefits of having a healthy incident response strategy, it seems that it can be a last resort for engineering teams at times. I have found this to be true again and again. The result of poor incident response management is often stressed-out, overworked engineers who can’t remediate issues quickly and never figure out how to resolve the issues long-term. In my experience though, I have found there are always two non-technical, easy-to-deploy solutions that can make a world of difference.

Post Mortem Agenda

When to use: After an incident, in a postmortem meeting where relevant stakeholders should all attend.

How to use:

  • Choose a moderator - someone who is familiar with the systems at question but not deeply involved with the incident

  • Create a blameless atmosphere and expectations

  • Follow the agenda, timebox certain topics to a time limit if necessary

  • Stick to the agenda and the facts

  • Walk away with actionable items that can prevent the incident from happening again or make it less severe

Template:

  • Date of outage (or range)

  • Timeline of Events

  • Facts

  • Reason (i.e. why was this change made in the first place?)

  • Actions - actions taken to resolve the issue

  • Preventions and solutions

  • Metrics

    • Time to recovery

    • Recovery point (i.e. data loss if any)

    • Impact (i.e. revenue, customers, etc.)

Alarm Runbook Templates

What it is: An alarm runbook details how someone should respond to a specific alarm going off. Your engineer on call who sees the alarm in the middle of the night may not have the experience to resolve an alarm, in which case they need to know things to try and how to escalate it if additional help is needed.

When to use: Every alarm ever created should have a runbook template attached to it.

How to use:

  • Have a runbook generated with every individual alarm

  • Have a mechanism that can deliver or link to the runbook when the alarm goes off.

Template:

Markdown linked on Github