Operations 16 min read

High‑Availability Architecture and Reliability Practices from a Former Google SRE

The article shares a former Google SRE’s insights on building high‑availability systems, explaining key factors such as MTBF and MTTR, redundancy strategies like N+2, change‑management practices, and practical tips for reliability engineering and operations.

High Availability Architecture

Dec 13, 2015

1. Two Major Factors Determining Availability

The author emphasizes that availability is fundamentally about Service Level Agreements (SLA) and introduces the concepts of Mean Time Between Failures (MTBF) and Mean Time To Recover (MTTR) as the two quantitative factors that drive a service’s uptime.

2. High‑Availability Solutions

Practical approaches include increasing redundancy (e.g., deploying N+2 instances instead of N+1), ensuring instances are independent and equally capable, and implementing sophisticated traffic‑control mechanisms that can isolate, quarantine, or block problematic requests based on business‑level attributes.

Change‑management techniques are highlighted: thorough offline testing, staged gray‑release deployments with carefully selected canary users, and mandatory rollback support built into the release process.

3. Seven‑Level Availability Maturity Model

The article presents a tiered chart ranging from Level 1 (crash with data loss) to Level 7 (failover with negligible user impact), describing the technical and operational investments required to progress through each level.

Q & A

Answers cover topics such as monitoring and SLA measurement, low‑level OS and hardware caching concerns for “crash without data loss,” the importance of error budgets, tools used for managing millions of servers, and concrete strategies for handling problematic traffic and safe rollbacks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

System Design SRE Reliability high-availability

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.