Operations 23 min read

What Does an SRE Do? A Practical Guide to Site Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its origins at Google, the challenges of hiring, the three-layer model of infrastructure, platform, and business SRE, and provides detailed responsibilities, on‑call practices, SLI/SLO management, capacity planning, and career advice for aspiring SREs.

Programmer DD

Nov 16, 2021

What Does an SRE Do? A Practical Guide to Site Reliability Engineering

SRE (Site Reliability Engineering) was first introduced by Google as a way to solve operational problems with software, emphasizing standardization, automation, scalability, and high availability. The role bridges the gap between rapid development and stable operations.

Hiring SREs is difficult because the position requires either developers with operational experience or operators with strong coding skills, and many candidates mistakenly view it as low‑level ops work.

Three‑Layer Model of SRE Work

Infrastructure : Manage hardware, networks, and basic IaaS‑style services (e.g., server procurement, CMDB, OS versioning, monitoring agents, network services such as DNS, NAT, firewalls, load balancers, CDN, certificate management).

Platform : Provide reusable middleware and PaaS‑style services (e.g., RPC, private cloud, queues like Kafka/RabbitMQ, distributed cron jobs, caches, API gateways, object storage, NoSQL databases, internal developer tools such as GitLab, CI/CD, image registries, and other dev‑ops utilities).

Business SRE : Own the reliability of specific services, participate in architecture design, capacity planning, load testing, on‑call duties, and incident response.

Deployment Practices

Deployments are split into Day 1 (initial launch) and Day 2+ (continuous updates). Reliable deployments require traceable changes, often managed via GitOps, and careful change‑control processes to ensure roll‑backs and gray‑testing.

On‑Call and Incident Management

On‑call involves monitoring alerts, verifying real incidents, diagnosing root causes, and applying SOPs (Standard Operating Procedures) to restore service quickly, often prioritizing immediate mitigation over exhaustive root‑cause analysis.

SLI/SLO Definition

Service Level Indicators (SLI) and Service Level Objectives (SLO) must be clearly defined (e.g., availability thresholds, measurement windows, granularity) and monitored; error budgets guide release decisions and operational actions.

Capacity Planning

Capacity planning requires modeling system resources and predicting growth, acknowledging that exact business expansion rates are hard to forecast.

User Support and Documentation

Effective documentation enables users to self‑serve, reducing repetitive support queries; documentation should be clear, example‑driven, and regularly updated.

Career Guidance

Transitioning to SRE involves solid programming, system design, OS, and networking knowledge; interview topics overlap with backend development and may include tools like Kubernetes and monitoring systems. Continuous learning and practical project experience are essential.

References

[1] Netflix – https://www.youtube.com/watch?v=koGaH4ffXaU

[2] Game Streaming SRE – https://azure.microsoft.com/mediahandler/files/resourcefiles/devops-at-microsoft-game-streaming-sre/DevOps%20at%20Microsoft%20-%20Xbox%20game%20streaming%20SRE.pdf

[3] GitOps – https://www.weave.works/technologies/gitops/

[4] SOP – https://en.wikipedia.org/wiki/Standard_operating_procedure

[5] Public post‑mortems – https://github.com/danluu/post-mortems

[6] Communication work – https://www.kawabangga.com/posts/4294

[7] Small‑team maintenance – https://archive.org/details/jonah-edwards-presentation

[8] AIOps critique – https://www.kawabangga.com/posts/4145

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Platform SRE infrastructure SLO Site Reliability Engineering Oncall SLI

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.