Blameless Post-Mortems: How to Quickly Move on From IT Incidents

“prepare for a future where we’re as stupid as we were today.” -The DevOps Handbook

Like most startups, Cherre wants to push the boundaries of what the company can achieve together in software development. Working according to DevOps best practices has always encouraged us toward achieving our organizational aims, though the process has not always been smooth.

Part of the ongoing DevOps process sees us continually looking for ways to better assess and formalize our operations, which included the decision to adopt the practice of blameless post-mortems to help us analyze development accidents. Thankfully, this is an anticipatory move we’ve taken rather than a reactive one—as can sometimes be the case. As the saying goes though, “Hope for the best, prepare for the worst.”

Following IT incidents, blameless post-mortems help teams understand why issues happen and how it is possible to better prepare systems for the future. Any action performed, viewed from different perspectives, can be judged as a success or a failure. It often depends entirely on the outcome. Concentrating on the action itself and judging the person behind it as the perpetrator won’t help teams discover how the incident happened. Furthermore, such a response to any incident will only alienate the team member involved, make them feel judged, and leave them less inclined to examine the influences that went into their decision-making in the first place. If anything, such a response will only incite them to get out of such a meeting as quickly as possible.

Rather than reactively attempting to set up the guidelines post-incident and discuss how such a post-mortem should be led, Cherre has outlined a set of best practice guides to follow should anything happen. These guidelines support DevOps cultural practices and encourage everyone to attend post-mortem meetings without the terror of finger-pointing looming above them.

The Benefits of Blameless Post-Mortems:
Post-mortems are the ultimate tool for learning and growing from IT incidents.

  • By focusing on the timeline, teams can reconstruct the past as closely as possible to determine where the system failed. The open, and welcoming, ‘blameless’ element provides a platform which will ensure team members remember the incident to their best ability.
  • Make post-mortems open to everyone; they are, after all, intended to be learning events for the organization as well as the team involved.
  • Each post-mortem should open with a reiteration that the meeting is not an attempt to apportion blame, but instead, to help the organization take preventative steps from the issue recurring in the future. As Benjamin Franklin stated, “An ounce of prevention is worth a pound of cure.”
  • Written timelines and minutes of steps taken will provide future records to help others learn from previous incidents. Furthermore, such documentation will aid in tracking the actions assigned and the steps that have been taken to prevent further issues.

We’ve decided to inherit this practice in an attempt to install better procedures which engender learning and help improve the working culture for our team. Rather than waiting for an issue to occur, Cherre advises setting up a formalized process to learn from these incidents in advance. Check out the pdf attached to see Cherre’s Post-Mortem Cheat Sheet.