What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices
This article explains what Site Reliability Engineering (SRE) is, outlines the three main layers of SRE work—Infrastructure, Platform, and Business—covers hiring challenges, daily duties such as deployment, on‑call, SLI/SLO management, capacity planning, user support, and offers practical interview and career advice.
What Is SRE?
SRE is a concept originally proposed by Google that uses software to solve operations problems, focusing on standardisation, automation, scalability, and high availability. It bridges the tension between rapid development and stable operations.
Hiring Challenges
Finding suitable SREs is difficult because the role requires either experienced operations engineers with development skills or developers with operations experience. Many candidates mistakenly think it is a low‑level "operations" job.
Typical SRE Variations
Different companies label SRE differently; some use traditional operations titles, while others have specialised roles such as financial‑security SRE or Game Streaming SRE at Microsoft.
Three Main SRE Layers
Infrastructure : hardware, networking, IaaS‑style resources, CMDB, OS version management, basic software (NTP, monitoring agents), login methods, observability stack, network services (DNS, NAT, firewalls, load balancers, CDN).
Platform : provides middleware services (queues, caches, RPC, cronjobs, object storage, databases, internal developer tools like GitLab, CI/CD, image registries, monitoring tools).
Business : maintains services and applications, participates in architecture design, provides technical support, handles on‑call, capacity planning, and user support.
Infrastructure SRE Details
Server procurement, budgeting, CMDB management.
Provision reliable deployment environments (VMs or bare metal).
Maintain OS and kernel versions.
Manage baseline software (NTP, monitoring agents).
Provide login, permission management, command auditing.
Operate observability infrastructure (monitoring, logging, tracing).
Maintain network infrastructure (connectivity, NAT, DNS, firewalls, L4/L7 load balancers, CDN).
Platform SRE Details
RPC services for service discovery and calls.
Private cloud services.
Queue services (Kafka, RabbitMQ).
Distributed cronjob services.
Caching services.
Gateway/reverse‑proxy configuration.
Object storage (e.g., S3).
Various databases (Elasticsearch, MongoDB, etc.).
Internal developer tools (self‑hosted GitLab, CI/CD pipelines, Harbor, distributed compilation, Sentry).
Big‑data and offline computation services.
Business SRE Details
With Platform SRE support, developers can focus on code without worrying about deployment. Business SREs understand service flow, dependencies, and design degradation strategies, participating in architecture and providing technical assistance.
System design (circuit breaking, degradation, scaling).
Load testing and capacity understanding.
Capacity planning.
On‑call responsibilities.
Certificate management.
On‑Call Practices
On‑call means ensuring services run smoothly: receive alerts, verify issues, locate root causes, and resolve problems. Alerts must be tuned to avoid noise, and monitoring dashboards should evolve with the business.
SLI/SLO Management
Defining and monitoring Service Level Indicators (SLI) and Objectives (SLO) requires clear definitions of availability, calculation windows, and error budgets, as well as procedures for when budgets are exhausted.
Post‑Incident Review
Post‑mortems should document timelines, actions, root‑cause analysis, and lessons learned while anonymising individuals. Action items should be meaningful, not merely bureaucratic.
Capacity Planning
Capacity planning is complex; it involves modeling system resources to anticipate growth and handle peak events.
User Support
Effective documentation reduces repetitive support requests; keep docs up‑to‑date and self‑serviceable.
Career Advice
Transitioning to SRE: solid programming, system design, OS, and networking knowledge.
Interview topics overlap with backend development; expect questions on Kubernetes, monitoring, and automation tools.
Code writing is essential and expected to be on par with professional backend engineers.
Choosing company size: small firms offer broad exposure; large firms provide deep specialization.
Assess company health by SRE‑to‑machine ratio and whether SREs are over‑staffed or under‑staffed.
Overall, SRE is a multifaceted discipline that blends development and operations to ensure reliable, scalable services.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.