Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability
This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.
What is SRE?
When first encountering SRE, many think it is a full‑stack role at Google that can solve many problems alone. In practice, SRE addresses many issues—capacity planning, chaos testing, rate limiting, circuit breaking, effective monitoring—but no single person can handle them all.
Therefore testing, development, operations and other roles must cooperate, making SRE a systematic methodology rather than a single position.
What are SRE’s goals?
Improve stability
The goal of an SRE system is “improve stability”. Two key indicators are used: Mean Time Between Failures (MTBF) and Mean Time To Recovery (MTTR). Raising MTBF and lowering MTTR reflects higher stability.
Sub‑goals
MTBF can be split into pre‑ and post‑phases; MTTR can be divided into four stages of incident handling. The following diagrams illustrate these phases and the corresponding work.
MTTR can be further divided into four indicators representing the four stages of a system fault.
MTBF can also be divided into two stages.
With these clear phase divisions, work can be targeted, for example using a design‑for‑failure approach (rate limiting, degradation, circuit breaking) in the pre‑MTBF stage and conducting post‑mortem analysis after MTBF.
Fundamental fault detection: SLI and SLO
To decide whether a system is in fault, SRE uses Service Level Indicators (SLI) and Service Level Objectives (SLO). An SLI must identify stability and be strongly related to user experience.
VALET selection method
The VALET method classifies potential SLIs into categories, helping practitioners pick appropriate metrics.
What is an SLO?
An SLO combines an SLI, a target value and a time window, e.g., “90 % of requests have latency ≤ 80 ms within one hour”. Multiple SLOs can be combined with logical AND to express overall availability.
Time vs. request dimensions
Time dimension evaluates how long an SLI stays below its threshold; request dimension evaluates the proportion of successful requests over a period.
For example, a 1‑minute request success rate below 95 % lasting 10 minutes is an anomaly (time dimension), while a daily success rate below 95 % is also an anomaly (request dimension).
Quantifying work with Error Budget
After defining SLOs, the error budget quantifies how many “mistakes” are allowed before the SLO is breached. It can be visualized as a burn‑down chart and used to prioritize work.
Consuming the error budget
Stability burn‑down chart
Display the remaining budget over a four‑week cycle; adjust the budget for special scenarios.
Incident grading
Use the percentage of consumed budget to assign incident severity (P0‑P4). The following diagram shows a typical grading table.
Stability consensus mechanism
When the remaining budget is low, teams should act cautiously, possibly halting non‑essential changes; when the budget is ample, minor issues can be tolerated.
Budget‑based alerts
Alerting based on error‑budget consumption naturally converges alerts and focuses on incidents that truly affect stability, reducing noise and improving response efficiency.
For brevity, the article ends here.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.