Operations 22 min read

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

This article explains what Site Reliability Engineering (SRE) is, outlines the three main layers of SRE work—Infrastructure, Platform, and Business—covers hiring challenges, daily duties such as deployment, on‑call, SLI/SLO management, capacity planning, user support, and offers practical interview and career advice.

Efficient Ops

Apr 8, 2024

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

What Is SRE?

SRE is a concept originally proposed by Google that uses software to solve operations problems, focusing on standardisation, automation, scalability, and high availability. It bridges the tension between rapid development and stable operations.

Hiring Challenges

Finding suitable SREs is difficult because the role requires either experienced operations engineers with development skills or developers with operations experience. Many candidates mistakenly think it is a low‑level "operations" job.

Typical SRE Variations

Different companies label SRE differently; some use traditional operations titles, while others have specialised roles such as financial‑security SRE or Game Streaming SRE at Microsoft.

Three Main SRE Layers

Infrastructure : hardware, networking, IaaS‑style resources, CMDB, OS version management, basic software (NTP, monitoring agents), login methods, observability stack, network services (DNS, NAT, firewalls, load balancers, CDN).

Platform : provides middleware services (queues, caches, RPC, cronjobs, object storage, databases, internal developer tools like GitLab, CI/CD, image registries, monitoring tools).

Business : maintains services and applications, participates in architecture design, provides technical support, handles on‑call, capacity planning, and user support.

Infrastructure SRE Details

Server procurement, budgeting, CMDB management.

Provision reliable deployment environments (VMs or bare metal).

Maintain OS and kernel versions.

Manage baseline software (NTP, monitoring agents).

Provide login, permission management, command auditing.

Operate observability infrastructure (monitoring, logging, tracing).

Maintain network infrastructure (connectivity, NAT, DNS, firewalls, L4/L7 load balancers, CDN).

Platform SRE Details

RPC services for service discovery and calls.

Private cloud services.

Queue services (Kafka, RabbitMQ).

Distributed cronjob services.

Caching services.

Gateway/reverse‑proxy configuration.

Object storage (e.g., S3).

Various databases (Elasticsearch, MongoDB, etc.).

Internal developer tools (self‑hosted GitLab, CI/CD pipelines, Harbor, distributed compilation, Sentry).

Big‑data and offline computation services.

Business SRE Details

With Platform SRE support, developers can focus on code without worrying about deployment. Business SREs understand service flow, dependencies, and design degradation strategies, participating in architecture and providing technical assistance.

System design (circuit breaking, degradation, scaling).

Load testing and capacity understanding.

Capacity planning.

On‑call responsibilities.

Certificate management.

On‑Call Practices

On‑call means ensuring services run smoothly: receive alerts, verify issues, locate root causes, and resolve problems. Alerts must be tuned to avoid noise, and monitoring dashboards should evolve with the business.

SLI/SLO Management

Defining and monitoring Service Level Indicators (SLI) and Objectives (SLO) requires clear definitions of availability, calculation windows, and error budgets, as well as procedures for when budgets are exhausted.

Post‑Incident Review

Post‑mortems should document timelines, actions, root‑cause analysis, and lessons learned while anonymising individuals. Action items should be meaningful, not merely bureaucratic.

Capacity Planning

Capacity planning is complex; it involves modeling system resources to anticipate growth and handle peak events.

User Support

Effective documentation reduces repetitive support requests; keep docs up‑to‑date and self‑serviceable.

Career Advice

Transitioning to SRE: solid programming, system design, OS, and networking knowledge.

Interview topics overlap with backend development; expect questions on Kubernetes, monitoring, and automation tools.

Code writing is essential and expected to be on par with professional backend engineers.

Choosing company size: small firms offer broad exposure; large firms provide deep specialization.

Assess company health by SRE‑to‑machine ratio and whether SREs are over‑staffed or under‑staffed.

Overall, SRE is a multifaceted discipline that blends development and operations to ensure reliable, scalable services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations SRE Site Reliability Engineering Oncall

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.