Why Most Incident Postmortems Miss the Mark and How to Fix Them
This article reveals three common pitfalls in daily incident postmortems—overlooking minor failures, confusing root causes with triggers, and weak improvement actions—and offers practical steps like the 5 Whys method and essential corrective measures to truly reduce online outages.
Last year a lengthy article titled “The Essence and Underlying Logic of Fault Postmortems” explained the theory behind postmortems; today we focus on three pitfalls that frequently appear in everyday work, especially as online services face increasing instability.
Pitfall 1: Not Only Severe Incidents Deserve a Postmortem
The core logic of postmortems is “prevent small issues from becoming big,” echoing the well‑known accident triangle: one severe accident is backed by 29 minor and 300 trivial ones. Teams often only conduct thorough postmortems for high‑profile incidents that attract attention, while routine small failures are ignored or only superficially reviewed.
Postmortems should be performed for every incident, just as a Go player reviews every move to improve. They can be lightweight, internal, and need not be presented to management. For a concise template, the Google SRE postmortem checklist is recommended.
Pitfall 2: Root Cause Analysis Must Reach the Real Root
The most effective technique is the 5 Whys : keep asking “why” until the deepest solvable cause is uncovered. It is crucial to separate the true root cause from the trigger cause. For example, a traffic surge (trigger) overloads the database, slowing microservices and causing the app to freeze; the root cause is the database’s performance under load, not the traffic spike itself.
Pitfall 3: Improvement Actions Are Inadequate
Typical improvement plans include short‑term fixes for rapid recovery and long‑term projects that often stall. The essential actions are twofold:
If the issue cannot be fully eliminated, define the fastest way to mitigate future occurrences.
Define how to completely prevent the same problem from happening again.
Additional measures can be deprioritized.
References: Accident Triangle, 5 Whys, Google postmortem checklist.
Consistent, serious postmortems are one of the most important methods for reducing online failures.
Tech Architecture Stories
Internet tech practitioner sharing insights on business architecture, technology, and a lifelong love of tech.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
