Operations 13 min read

How SRE Designs Highly Available Software Systems at Scale

This article presents Google SRE expert Ramón Medrano Llamas’s comprehensive guide on designing, operating, and maintaining large‑scale, highly available software systems, covering SRE fundamentals, daily workflows, scalability strategies, fault‑tolerant architecture, monitoring, and operational best practices.

21CTO

Nov 15, 2019

How SRE Designs Highly Available Software Systems at Scale

Introduction

With the rise of Cloud Native, SRE, and DevOps, the concepts of high availability, scalability, and efficient operations for large‑scale software systems are now encapsulated by the term "Reliability Engineering".

Part 1 – What Is Site Reliability Engineering (SRE)?

SRE combines software engineering and systems administration, often seen as another name for DevOps engineers. Google defines SRE as software engineers with a unique mission who ensure continuous, uninterrupted service by blending proactive and reactive engineering.

Who we are: software engineers with a unique mission.

What we do: provide continuous, uninterrupted service.

How we do it: combine proactive and reactive engineering.

Why we do it: users expect Google to be always available, fast, and correct.

Part 2 – A Day in the Life of an SRE

A typical day includes reviewing code and design documents, reading emails, attending video meetings, brainstorming, designing, coding, daily stand‑ups, technical discussions, monitoring dashboards, writing roadmaps, and handling incidents.

Check code, changes, and design docs.

Read emails and join video calls.

Brainstorm, analyze, design, and code.

Attend daily stand‑up.

Participate in technical discussions.

Watch system dashboards.

Write roadmaps and plans.

Chat, eat, play, and laugh.

Part 3 – Designing for Scale

To serve billions of requests, systems must handle 99.9…% of queries, with reliability costs growing exponentially as availability targets increase. Resources required include massive numbers of servers, storage, and network capacity.

Serve 99.XXX% of queries, not 100%.

Reliability cost rises exponentially with higher availability.

Potential failures: memory, CPU, disks, NICs, power, cables, even excavators.

Scaling a simple web service from a single server to support 100 million users requires handling 10 million requests per second at peak, translating to roughly 2 million IOPS and thousands of servers.

Part 4 – Dissecting Large‑Scale Web Applications

Replication and mesh topologies improve availability and scalability. Abstracting groups of servers into boxes simplifies architecture diagrams, which can then be layered to show inter‑server communication.

Part 5 – Designing for Failure

Failures can arise from hardware, network, power, software bugs, or human error. Redundancy (dual power supplies, RAID, redundant switches, and databases) mitigates many failures, but costs increase.

Machine, switch, PDU, router, fiber, power station, software bugs, human error, attacks.

Effective fault handling includes traffic shifting, repairing faulty machines, and using automated rollbacks for software defects.

Part 6 – Operational Best Practices

Change management must avoid downtime; deployments should be incremental, starting with a single machine, then a rack, a cluster, a region, and finally globally. Disaster recovery focuses on having a recovery plan rather than just backups.

Deploy small changes first, then scale up.

Rollback when issues arise.

Maintain documentation and configuration databases.

How SRE Designs Highly Available Software Systems at Scale

Introduction

Part 1 – What Is Site Reliability Engineering (SRE)?

Part 2 – A Day in the Life of an SRE

Part 3 – Designing for Scale

Part 4 – Dissecting Large‑Scale Web Applications

Part 5 – Designing for Failure

Part 6 – Operational Best Practices

Recommended Reading

21CTO

How this landed with the community

Was this worth your time?

0 Comments