Operations 24 min read

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

This article explains the essence, purpose, and step‑by‑step process of fault postmortems—including preparation, root‑cause analysis, improvement actions, and decision making—while covering PDCA and GRIA methodologies, industry examples, MTTR/MTBF metrics, and practical templates for lasting reliability.

Tech Architecture Stories

Aug 7, 2023

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

Origin and Business Use of Postmortems

Originally a Go‑game term, postmortem means replaying a finished game to evaluate each move. In business and finance, it describes a detailed review after a project or transaction to identify successes and areas for improvement.

Companies such as Lenovo, Google, Amazon, Toyota, and GE have adopted postmortems in their management practices.

Methodologies

PDCA (Plan‑Do‑Check‑Act)

PDCA is a four‑stage continuous‑improvement cycle proposed by Dr. W. Edwards Deming.

Plan : Define goals, understand the current state, collect data, and design solutions.

Do : Execute the plan on a small scale to test its effectiveness.

Check : Measure results against objectives and evaluate the execution.

Act : If successful, roll out broadly; otherwise, return to the planning stage.

The PDCA cycle emphasizes iterative improvement and disciplined problem solving.

GRIA (Goal‑Result‑Analysis‑Insight)

GRIA is a systematic reflection framework:

Goal : State the specific objective.

Result : Compare actual outcomes with the goal.

Analysis : Dig into why the gap exists.

Insight : Derive learnings and actionable recommendations.

Combining PDCA and GRIA

Many teams merge PDCA and GRIA to (1) extract valuable lessons and (2) plan more efficient future actions.

Fault Postmortem in Different Domains

Aviation Safety

The aircraft “black box” (actually bright orange) consists of a flight‑data recorder and a cockpit‑voice recorder. Systematic postmortems of crashes have reduced accident rates from 12% in the 1960s to less than 0.001% today, making aviation the safest transport mode.

Industrial Accident Theory: Accident Triangle

The accident triangle (Heinrich’s triangle) shows that reducing minor incidents and near‑misses also lowers severe accidents.

Software‑Engineering Fault Management

Faults directly affect service availability. Key reliability metrics include:

MTTF – Mean Time To Failure

MTBF – Mean Time Between Failures

MTTR – Mean Time To Recovery

MTTI – Mean Time To Identify

MTTK – Mean Time To Know (locate)

MTTV – Mean Time To Verify

MTTM – Mean Time To Mitigate

Availability can be expressed as: Availability = 1 – (MTTR / (MTTR + MTBF))

Improving reliability involves increasing MTBF (preventing faults) and decreasing MTTR (faster recovery).

Goals of a Fault Postmortem

Trace the root cause and understand the full incident timeline.

Identify concrete improvement actions to prevent recurrence.

Accelerate detection, localization, mitigation, and recovery.

Raise awareness and share learnings across the organization.

Postmortem Culture

Postmortems are routine, not reserved for only major incidents.

Leadership must participate to foster a blameless environment.

Teams should maintain a respectful attitude toward production systems.

Culture is built gradually and must be handed over as staff turnover occurs.

Truth‑seeking is essential; avoiding cover‑ups makes the process valuable.

Responsibility Assignment

Assigning blame can hinder learning. Instead, focus on whether the incident prompted warning and improvement, regardless of personal fault.

Subjective responsibility (e.g., ignoring processes, releasing without testing).

Objective responsibility (e.g., extreme load spikes, architectural limits).

Postmortem Process

Preparation

Define impact scope with quantitative data (business metrics, user feedback, availability impact, revenue loss) and construct a complete timeline covering time, place, people, event, cause, and result.

Root‑Cause Analysis

Common techniques: 5‑Why analysis Tracing the origin Identifying trigger factors Considering both human and technical causes Avoiding single‑point root causes

Example (Toyota): a machine stopped due to an uninstalled filter; the 5‑Why method revealed the missing filter as the root cause.

Improvement Measures

Improvement actions should follow the SMART criteria (Specific, Measurable, Achievable, Relevant, Time‑bound) and target:

Preventing recurrence (increase MTBF).

Reducing MTTR by speeding up detection (lower MTTI), localization (lower MTTK), and mitigation (lower MTTF).

Embedding automated safeguards to reduce human error.

Recording actions in a tracking system (e.g., JIRA) for regular follow‑up.

Decision & Follow‑Up

Postmortem decisions are documented in a blameless meeting with clear consensus on actions, responsibilities, and timelines. Follow‑up includes periodic review of action completion rates and trend analysis of incident frequency, MTTR, and MTBF.

Long‑Term Knowledge Accumulation

Postmortem documents should be indexed in a searchable knowledge base (wiki) to create an incident‑response handbook and enable regular analysis of improvement effectiveness.

Google Postmortem Template

Google’s SRE workbook provides a concise postmortem template with five sections: data collection, root‑cause analysis (using 5‑Whys), good practices, action items, and a review that checks for blame‑free language.

Key Takeaways

Fault postmortems analyze MTTx processes and drive improvements.

Four core objectives: root tracing, improvement, prevention, and learning.

Four phases: preparation, root‑cause analysis, improvement actions, and decision making.

Adopt PDCA and GRIA, use SMART for actions, and maintain a blameless culture.

References

[1] Raymond Houch – “How Podcasters Should Conduct Postmortems”. [2] Wikipedia – Accident Triangle. [3] Continuous Delivery 20 – SRE MTTR article. [4] Wikipedia – 5‑Why analysis. [5] Wikipedia – SMART criteria. [6] Google SRE Workbook – Postmortem template.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SRE Incident Management reliability PDCA MTTR postmortem GRIA

Written by

Tech Architecture Stories

Internet tech practitioner sharing insights on business architecture, technology, and a lifelong love of tech.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.