Mastering Fault Postmortems: Proven Methods to Boost System Reliability
This article explains the essence, purpose, and step‑by‑step process of fault postmortems—including preparation, root‑cause analysis, improvement actions, and decision making—while covering PDCA and GRIA methodologies, industry examples, MTTR/MTBF metrics, and practical templates for lasting reliability.
Origin and Business Use of Postmortems
Originally a Go‑game term, postmortem means replaying a finished game to evaluate each move. In business and finance, it describes a detailed review after a project or transaction to identify successes and areas for improvement.
Companies such as Lenovo, Google, Amazon, Toyota, and GE have adopted postmortems in their management practices.
Methodologies
PDCA (Plan‑Do‑Check‑Act)
PDCA is a four‑stage continuous‑improvement cycle proposed by Dr. W. Edwards Deming.
Plan : Define goals, understand the current state, collect data, and design solutions.
Do : Execute the plan on a small scale to test its effectiveness.
Check : Measure results against objectives and evaluate the execution.
Act : If successful, roll out broadly; otherwise, return to the planning stage.
The PDCA cycle emphasizes iterative improvement and disciplined problem solving.
GRIA (Goal‑Result‑Analysis‑Insight)
GRIA is a systematic reflection framework:
Goal : State the specific objective.
Result : Compare actual outcomes with the goal.
Analysis : Dig into why the gap exists.
Insight : Derive learnings and actionable recommendations.
Combining PDCA and GRIA
Many teams merge PDCA and GRIA to (1) extract valuable lessons and (2) plan more efficient future actions.
Fault Postmortem in Different Domains
Aviation Safety
The aircraft “black box” (actually bright orange) consists of a flight‑data recorder and a cockpit‑voice recorder. Systematic postmortems of crashes have reduced accident rates from 12% in the 1960s to less than 0.001% today, making aviation the safest transport mode.
Industrial Accident Theory: Accident Triangle
The accident triangle (Heinrich’s triangle) shows that reducing minor incidents and near‑misses also lowers severe accidents.
Software‑Engineering Fault Management
Faults directly affect service availability. Key reliability metrics include:
MTTF – Mean Time To Failure
MTBF – Mean Time Between Failures
MTTR – Mean Time To Recovery
MTTI – Mean Time To Identify
MTTK – Mean Time To Know (locate)
MTTV – Mean Time To Verify
MTTM – Mean Time To Mitigate
Availability can be expressed as: Availability = 1 – (MTTR / (MTTR + MTBF))
Improving reliability involves increasing MTBF (preventing faults) and decreasing MTTR (faster recovery).
Goals of a Fault Postmortem
Trace the root cause and understand the full incident timeline.
Identify concrete improvement actions to prevent recurrence.
Accelerate detection, localization, mitigation, and recovery.
Raise awareness and share learnings across the organization.
Postmortem Culture
Postmortems are routine, not reserved for only major incidents.
Leadership must participate to foster a blameless environment.
Teams should maintain a respectful attitude toward production systems.
Culture is built gradually and must be handed over as staff turnover occurs.
Truth‑seeking is essential; avoiding cover‑ups makes the process valuable.
Responsibility Assignment
Assigning blame can hinder learning. Instead, focus on whether the incident prompted warning and improvement, regardless of personal fault.
Subjective responsibility (e.g., ignoring processes, releasing without testing).
Objective responsibility (e.g., extreme load spikes, architectural limits).
Postmortem Process
Preparation
Define impact scope with quantitative data (business metrics, user feedback, availability impact, revenue loss) and construct a complete timeline covering time, place, people, event, cause, and result.
Root‑Cause Analysis
Common techniques: 5‑Why analysis Tracing the origin Identifying trigger factors Considering both human and technical causes Avoiding single‑point root causes
Example (Toyota): a machine stopped due to an uninstalled filter; the 5‑Why method revealed the missing filter as the root cause.
Improvement Measures
Improvement actions should follow the SMART criteria (Specific, Measurable, Achievable, Relevant, Time‑bound) and target:
Preventing recurrence (increase MTBF).
Reducing MTTR by speeding up detection (lower MTTI), localization (lower MTTK), and mitigation (lower MTTF).
Embedding automated safeguards to reduce human error.
Recording actions in a tracking system (e.g., JIRA) for regular follow‑up.
Decision & Follow‑Up
Postmortem decisions are documented in a blameless meeting with clear consensus on actions, responsibilities, and timelines. Follow‑up includes periodic review of action completion rates and trend analysis of incident frequency, MTTR, and MTBF.
Long‑Term Knowledge Accumulation
Postmortem documents should be indexed in a searchable knowledge base (wiki) to create an incident‑response handbook and enable regular analysis of improvement effectiveness.
Google Postmortem Template
Google’s SRE workbook provides a concise postmortem template with five sections: data collection, root‑cause analysis (using 5‑Whys), good practices, action items, and a review that checks for blame‑free language.
Key Takeaways
Fault postmortems analyze MTTx processes and drive improvements.
Four core objectives: root tracing, improvement, prevention, and learning.
Four phases: preparation, root‑cause analysis, improvement actions, and decision making.
Adopt PDCA and GRIA, use SMART for actions, and maintain a blameless culture.
References
[1] Raymond Houch – “How Podcasters Should Conduct Postmortems”. [2] Wikipedia – Accident Triangle. [3] Continuous Delivery 20 – SRE MTTR article. [4] Wikipedia – 5‑Why analysis. [5] Wikipedia – SMART criteria. [6] Google SRE Workbook – Postmortem template.
Tech Architecture Stories
Internet tech practitioner sharing insights on business architecture, technology, and a lifelong love of tech.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
