Mastering Incident Reviews: The Three Golden Questions for Real Improvement
This article explains how focusing on three key questions during incident post‑mortems, balancing business speed with system stability, and establishing clear SLOs can turn failures into actionable improvements and better fault‑tolerance strategies.
Golden Three Questions – Focusing on Improvement
Incident reviews often become blame‑shifting sessions because the purpose is misunderstood; instead of finding who is at fault, the focus should be on how to improve. Repeating three questions helps keep the discussion centered on improvement:
What is the true root cause of the incident?
What actions can we take to prevent a similar incident next time?
How can we shorten the incident duration and restore services faster?
Continuously revisiting these questions until concrete improvement measures are identified leads to more focused discussions.
Business vs. Stability: Fault Tolerance
From an SRE or infrastructure perspective, stability is non‑negotiable, but business teams prioritize rapid development and revenue. Two real cases illustrate this trade‑off: a fast‑growing e‑commerce platform weighing detailed fault analysis against business momentum, and an advertising service that tolerates high fault rates as long as revenue is unaffected.
The conclusion is that business growth often takes precedence, though mechanisms should still exist to manage faults scientifically.
Why SLOs Are Needed
An SLO creates a shared standard for stability and fault rates, aligning expectations between providers and customers. Without a common benchmark, assumptions like “100% stability” are unrealistic, leading to mismatched perceptions of impact.
For example, a regional fiber cut affected 70‑80% of video streams for a live‑streaming service, while the cloud provider measured only a 2‑3% global impact. This discrepancy highlights the need for a unified SLO.
Establishing SLOs requires cloud providers to consider customer‑specific goals, which can be challenging due to diverse requirements, potential pressure on provider resources, and ROI concerns.
In practice, providers often stick to SLA commitments, treating SLOs as optional value‑added services that incur additional cost when demanded.
Additional Insights
Google’s post‑SRE offering, called CRE (Customer Reliability Engineering), exemplifies a service that bridges the gap between SLA and SLO, providing tailored reliability engineering for customers.
G7 EasyFlow Tech Circle
Official G7 EasyFlow tech channel! All the hardcore tech, cutting‑edge innovations, and practical sharing you want are right here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
