Operations 8 min read

Mastering Incident Reviews: The Three Golden Questions for Real Improvement

This article explains how focusing on three key questions during incident post‑mortems, balancing business speed with system stability, and establishing clear SLOs can turn failures into actionable improvements and better fault‑tolerance strategies.

G7 EasyFlow Tech Circle

Dec 27, 2019

Mastering Incident Reviews: The Three Golden Questions for Real Improvement

Golden Three Questions – Focusing on Improvement

Incident reviews often become blame‑shifting sessions because the purpose is misunderstood; instead of finding who is at fault, the focus should be on how to improve. Repeating three questions helps keep the discussion centered on improvement:

What is the true root cause of the incident?

What actions can we take to prevent a similar incident next time?

How can we shorten the incident duration and restore services faster?

Continuously revisiting these questions until concrete improvement measures are identified leads to more focused discussions.

Business vs. Stability: Fault Tolerance

From an SRE or infrastructure perspective, stability is non‑negotiable, but business teams prioritize rapid development and revenue. Two real cases illustrate this trade‑off: a fast‑growing e‑commerce platform weighing detailed fault analysis against business momentum, and an advertising service that tolerates high fault rates as long as revenue is unaffected.

The conclusion is that business growth often takes precedence, though mechanisms should still exist to manage faults scientifically.

Why SLOs Are Needed

An SLO creates a shared standard for stability and fault rates, aligning expectations between providers and customers. Without a common benchmark, assumptions like “100% stability” are unrealistic, leading to mismatched perceptions of impact.

For example, a regional fiber cut affected 70‑80% of video streams for a live‑streaming service, while the cloud provider measured only a 2‑3% global impact. This discrepancy highlights the need for a unified SLO.

Establishing SLOs requires cloud providers to consider customer‑specific goals, which can be challenging due to diverse requirements, potential pressure on provider resources, and ROI concerns.

In practice, providers often stick to SLA commitments, treating SLOs as optional value‑added services that incur additional cost when demanded.

Additional Insights

Google’s post‑SRE offering, called CRE (Customer Reliability Engineering), exemplifies a service that bridges the gap between SLA and SLO, providing tailored reliability engineering for customers.