Mastering Fault Postmortems: Proven Methods to Boost System Reliability
This comprehensive guide explains the origins, methodologies, and practical steps of fault postmortems—including PDCA, GRIA, aviation safety lessons, industrial accident theory, and software reliability metrics—to help teams systematically investigate incidents, derive actionable improvements, and continuously enhance system availability.
Origin and Application in Enterprise Management
Postmortem, originally a Go game term, refers to reviewing a completed game to assess each move; the concept has been adopted in business and finance to analyze projects or transactions after completion, identifying strengths and areas for improvement. Lenovo first introduced postmortem into corporate culture in China, and companies like Google, Amazon, Toyota, and GE now use structured postmortem processes.
Methodology
PDCA
PDCA (Plan‑Do‑Check‑Act), also known as the Deming Cycle, is a four‑stage iterative model for continuous improvement.
Plan : Define goals, understand the current state, collect data, and devise solutions.
Do : Execute the plan in a controlled environment, often as a small‑scale test.
Check : Gather data and evaluate whether objectives were met.
Act : If successful, roll out broadly; otherwise, return to planning.
PDCA aims to improve processes, products, or services through disciplined, repeatable cycles.
GRIA
GRIA (Goal‑Result‑Analysis‑Insight) provides a systematic reflection framework.
Goal : Clarify specific objectives for the project or task.
Result : Assess actual outcomes versus expectations.
Analysis : Investigate reasons for gaps, using data, process review, and decision analysis.
Insight : Derive learnings and actionable improvements.
GRIA helps teams continuously refine their goals and processes.
Combining PDCA and GRIA
Many teams integrate PDCA and GRIA to (1) extract actionable lessons and (2) improve future efficiency, using both tools to structure analysis and drive concrete actions.
Fault Postmortem
Aviation Safety and Postmortem
The aircraft “black box” (flight data recorder and cockpit voice recorder) is central to aviation incident analysis, dramatically reducing accident rates from the 1960s to today.
Airbus 2021 Commercial Aviation Accident Statistics Report shows a clear decline in fatal accident rates per million flight hours since the 1960s.
Industrial Accident Triangle
The accident triangle (Heinrich’s triangle) illustrates that reducing minor incidents lowers the likelihood of severe accidents, forming a foundational safety theory.
Software Engineering Fault Management
Faults impact service availability; key metrics include MTTR, MTBF, MTTI, MTTK, MTTF, MTTV, and MTTM. Availability calculations (e.g., 99.9% uptime) translate to allowable downtime per month.
Definitions MTTR – Mean Time to Recovery MTBF – Mean Time Between Failure MTTI – Mean Time to Identify MTTK – Mean Time to Know MTTF – Mean Time to Fix MTTV – Mean Time to Verify MTTM – Mean Time to Mitigate MTTR = MTTI + MTTK + MTTF + MTTV
Long‑Term Knowledge Accumulation and Summary
Postmortem documents should be indexed in a knowledge base (e.g., wiki) and regularly reviewed to track improvement‑action completion rates, MTTR trends, and recurring fault patterns.
Google Fault Postmortem Template Analysis
Google’s SRE Workbook provides a concise five‑section postmortem template covering fault data collection, root‑cause analysis (using 5 Whys), good practices, action items, and a review to avoid blame‑oriented language.
Key Takeaways
Fault postmortems aim to trace root causes, implement improvements, prevent recurrence, and foster learning.
Effective postmortems follow a four‑stage process: preparation, root‑cause analysis, improvement actions, and resolution.
Metrics such as MTTR, MTBF, and MTTx guide measurement of reliability gains.
Adopt blameless, data‑driven meetings and ensure follow‑up of action items using tools like JIRA.
Tech Architecture Stories
Internet tech practitioner sharing insights on business architecture, technology, and a lifelong love of tech.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
