Operations 14 min read

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

Efficient Ops

Nov 7, 2023

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

What is SRE?

When first encountering SRE, many think it is a full‑stack role at Google that can solve many problems alone. In practice, SRE addresses many issues—capacity planning, chaos testing, rate limiting, circuit breaking, effective monitoring—but no single person can handle them all.

Therefore testing, development, operations and other roles must cooperate, making SRE a systematic methodology rather than a single position.

What are SRE’s goals?

Improve stability

The goal of an SRE system is “improve stability”. Two key indicators are used: Mean Time Between Failures (MTBF) and Mean Time To Recovery (MTTR). Raising MTBF and lowering MTTR reflects higher stability.

Sub‑goals

MTBF can be split into pre‑ and post‑phases; MTTR can be divided into four stages of incident handling. The following diagrams illustrate these phases and the corresponding work.

MTTR can be further divided into four indicators representing the four stages of a system fault.

MTBF can also be divided into two stages.

With these clear phase divisions, work can be targeted, for example using a design‑for‑failure approach (rate limiting, degradation, circuit breaking) in the pre‑MTBF stage and conducting post‑mortem analysis after MTBF.

Fundamental fault detection: SLI and SLO

To decide whether a system is in fault, SRE uses Service Level Indicators (SLI) and Service Level Objectives (SLO). An SLI must identify stability and be strongly related to user experience.

VALET selection method

The VALET method classifies potential SLIs into categories, helping practitioners pick appropriate metrics.

What is an SLO?

An SLO combines an SLI, a target value and a time window, e.g., “90 % of requests have latency ≤ 80 ms within one hour”. Multiple SLOs can be combined with logical AND to express overall availability.

Time vs. request dimensions

Time dimension evaluates how long an SLI stays below its threshold; request dimension evaluates the proportion of successful requests over a period.

For example, a 1‑minute request success rate below 95 % lasting 10 minutes is an anomaly (time dimension), while a daily success rate below 95 % is also an anomaly (request dimension).

Quantifying work with Error Budget

After defining SLOs, the error budget quantifies how many “mistakes” are allowed before the SLO is breached. It can be visualized as a burn‑down chart and used to prioritize work.

Consuming the error budget

Stability burn‑down chart

Display the remaining budget over a four‑week cycle; adjust the budget for special scenarios.

Incident grading

Use the percentage of consumed budget to assign incident severity (P0‑P4). The following diagram shows a typical grading table.

Stability consensus mechanism

When the remaining budget is low, teams should act cautiously, possibly halting non‑essential changes; when the budget is ample, minor issues can be tolerated.

Budget‑based alerts

Alerting based on error‑budget consumption naturally converges alerts and focuses on incidents that truly affect stability, reducing noise and improving response efficiency.

For brevity, the article ends here.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations SRE Reliability MTBF MTTR SLO SLI ErrorBudget

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.