Incident Playbook
- David Peček
- Jan 26, 2020
- 3 min read
Updated: Sep 12, 2020

When a major problem in your organization arises what is the process you follow to handle that issue? This guide is intended to be a reference on some quick thinking / triage steps as well as communications to send out during the event. It is important to have a practiced and coordinated response to each incident. This instills confidence in your operations team to resolve customer facing crisis fast.
Practice incident response, teach incident roles, involve only who you need to.
Team Roles
The operations team which handles incidents should be clearly defined before hand and those involved should know their roles. This way when called to action they know immediately what to be doing. If you have shift work you may need to designate several people into each role depending on when people are working. Notice these follow Googles approach to incident responses.
Communications lead. Designate someone to use the status page notification system to do a write up on what is wrong, assess impact and then send out to customers / the company notifying them of what is wrong.
Technical lead / subject matter expert. This should be a person or persons who are technically component in most areas of the company who can be be the subject matter expert to quickly triage and diagnose the reason for the incident.
Manager. This role is intended more as a coordinator to ensure everyone has what they need as well as communicating with members outside of this team for additional resourcing.
Person who escalated. The individual who raised the escalation should be present and available at the beginning of the process to ensure the team is clear on what the issue is. Also their presence signifies they also understand how important this is.
When to Bring in Others
I would steer clear of the dog pile approach to emergency management. The more people you have involved in the incident sometimes the process and means to an outcome can be muddled. Its best to just stick with the core roles listed above. If needed the incident manager can reach out to others in the company for assistance. Train employees to know when this team assembles during an incident, they should stop what they are doing and help out if called upon.
Play by Play
These steps are intended as a guideline to run through each time to ensure you are correctly handling each aspect of the incident and don't have to think about forgetting anything.
Asses the situation. The manager should meet with the person who escalated the issue to review and understand the data from the escalation and quantify the impact. In this meeting they should figure out what kind of incident this is: major or regular.
Engage the team. Based on your SLAs, for a major incident the manager should engage the rest of the team immediately. For a regular incident schedule a meeting to go over the problem within the initial SLA requirements.
Communicate. Now it is time for the communications lead to send out notice to the company once the problem and its impact have been clearly defined.
Restore or correct. The technical lead should now do what is needed to restore the outage or correct the data. If this is not possible then write up the needed development ticket in this step and handoff to development teams.
Communicate. At this point the communications lead should send out another notice to the company that either the situation has been resolved or the development steps required to correct.
Post mortem. Once complete and the details are still fresh in everyones minds, schedule a meeting within a day to review and assess the details and next steps for future avoidance.
Comments