Operations 15 min read

Guidelines for Incident Postmortem and Fault Review

The incident postmortem guideline advocates a dialectical view of failures, rapid low‑severity recovery, and a structured process—covering background, impact scope, timeline replay, deep root‑cause analysis, SMART improvement actions, responsibility assignment, and PDCA‑validated closure—to enhance system resilience, team anti‑fragility, and knowledge sharing.

HelloTech
HelloTech
HelloTech
Guidelines for Incident Postmortem and Fault Review

In complex systems, failures are inevitable. From a qualitative perspective, not every failure is purely negative; some failures reveal hidden risks or demonstrate effective emergency response, and these positive aspects should be highlighted.

We should adopt a dialectical view of failures, avoiding the mindset of "fearing failures". Our fault‑management policy encourages rapid recovery (low‑severity classification for quickly resolved incidents) and permits certain exemptions for failures caused by drills, while emphasizing that occasional system outages are tolerable but human errors must be taken seriously.

Three purposes of postmortem:

Identify the root cause, optimize and improve the system, and provide references for others.

Find ways to reduce the probability of recurrence – increase MTBF.

Find ways to accelerate business recovery – shorten MTTR.

Both the system and the organization must be highly available. A resilient architecture should be complemented by an anti‑fragile team. During postmortem, we should examine not only technical issues but also tools, processes, and management.

Overall postmortem process

1. Fault background overview : Describe when the incident occurred, which services were affected, and the business impact (e.g., order failures, user complaints).

2. Align fault impact scope : Detail the time range, affected business lines, services, order volume, user impact, and any financial loss.

3. Fault timeline replay : Decompose MTTR into MTTI, MTTK, MTTF, and MTTV.

Mean Time To Identify (MTTI): time from incident start to emergency response.

Mean Time To Know (MTTK): time from response to fault localization.

Mean Time To Fix (MTTF): time from localization to recovery.

Mean Time To Verify (MTTV): time from recovery to confirmation.

Key timeline checkpoints include fault introduction, business metric changes, alert generation, on‑call response, anomaly detection, critical operations, and verification of recovery.

Deep root‑cause analysis : Distinguish between direct (trigger) causes and fundamental causes. Use the 5‑Why method to drill down, e.g., an ES node jitter leading to latency spikes and upstream failures.

Improvement items summary : After confirming the timeline and root causes, derive actionable improvement items. Consider both system‑level fixes and process enhancements such as monitoring, on‑call policies, tooling, emergency plans, architecture, and procedural rules.

Use the SMART principle to formulate improvements:

S – Specific: concrete, implementable actions.

M – Measurable: assessable outcomes (e.g., through drills).

A – Attainable: feasible within current technical constraints.

R – Relevant: linked to the incident’s findings.

T – Time‑bound: clear deadline for completion.

All improvements must be closed‑looped via the PDCA cycle (Plan → Do → Check → Act) and validated through fault‑drill exercises.

Fault responsibility : Responsibility is assigned first to the team, then to individuals. Emphasize that most failures have managerial or process roots, not just personal mistakes.

Encourage rapid mitigation and active participation, give positive feedback to contributors, and treat third‑party components as the introducer’s responsibility.

Red lines and “military rules” are non‑negotiable baselines derived from past incidents. Repeated mistakes must be taken seriously, with analysis of why prior improvements were not enforced.

Postmortem is not the end of the incident; only after improvement items are verified does the process conclude. Documentation should be reused for onboarding and knowledge sharing to avoid repeat failures.

operationsHigh Availabilityincident managementroot cause analysisMTBFMTTRpostmortem
HelloTech
Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.