What Does an SRE Do? A Practical Guide to Site Reliability Engineering
This article explains the role of Site Reliability Engineering (SRE), its origins at Google, the challenges of hiring, the three-layer model of infrastructure, platform, and business SRE, and provides detailed responsibilities, on‑call practices, SLI/SLO management, capacity planning, and career advice for aspiring SREs.
SRE (Site Reliability Engineering) was first introduced by Google as a way to solve operational problems with software, emphasizing standardization, automation, scalability, and high availability. The role bridges the gap between rapid development and stable operations.
Hiring SREs is difficult because the position requires either developers with operational experience or operators with strong coding skills, and many candidates mistakenly view it as low‑level ops work.
Three‑Layer Model of SRE Work
Infrastructure : Manage hardware, networks, and basic IaaS‑style services (e.g., server procurement, CMDB, OS versioning, monitoring agents, network services such as DNS, NAT, firewalls, load balancers, CDN, certificate management).
Platform : Provide reusable middleware and PaaS‑style services (e.g., RPC, private cloud, queues like Kafka/RabbitMQ, distributed cron jobs, caches, API gateways, object storage, NoSQL databases, internal developer tools such as GitLab, CI/CD, image registries, and other dev‑ops utilities).
Business SRE : Own the reliability of specific services, participate in architecture design, capacity planning, load testing, on‑call duties, and incident response.
Deployment Practices
Deployments are split into Day 1 (initial launch) and Day 2+ (continuous updates). Reliable deployments require traceable changes, often managed via GitOps, and careful change‑control processes to ensure roll‑backs and gray‑testing.
On‑Call and Incident Management
On‑call involves monitoring alerts, verifying real incidents, diagnosing root causes, and applying SOPs (Standard Operating Procedures) to restore service quickly, often prioritizing immediate mitigation over exhaustive root‑cause analysis.
SLI/SLO Definition
Service Level Indicators (SLI) and Service Level Objectives (SLO) must be clearly defined (e.g., availability thresholds, measurement windows, granularity) and monitored; error budgets guide release decisions and operational actions.
Capacity Planning
Capacity planning requires modeling system resources and predicting growth, acknowledging that exact business expansion rates are hard to forecast.
User Support and Documentation
Effective documentation enables users to self‑serve, reducing repetitive support queries; documentation should be clear, example‑driven, and regularly updated.
Career Guidance
Transitioning to SRE involves solid programming, system design, OS, and networking knowledge; interview topics overlap with backend development and may include tools like Kubernetes and monitoring systems. Continuous learning and practical project experience are essential.
References
[1] Netflix – https://www.youtube.com/watch?v=koGaH4ffXaU
[2] Game Streaming SRE – https://azure.microsoft.com/mediahandler/files/resourcefiles/devops-at-microsoft-game-streaming-sre/DevOps%20at%20Microsoft%20-%20Xbox%20game%20streaming%20SRE.pdf
[3] GitOps – https://www.weave.works/technologies/gitops/
[4] SOP – https://en.wikipedia.org/wiki/Standard_operating_procedure
[5] Public post‑mortems – https://github.com/danluu/post-mortems
[6] Communication work – https://www.kawabangga.com/posts/4294
[7] Small‑team maintenance – https://archive.org/details/jonah-edwards-presentation
[8] AIOps critique – https://www.kawabangga.com/posts/4145
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
