Operations 22 min read

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

Architect

Apr 16, 2022

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

SRE (Site Reliability Engineering) originated at Google as a way to solve the conflict between rapid development and stable operations by applying software engineering to operational problems, focusing on standardisation, automation, scalability, and high availability.

Hiring SREs is challenging because the role requires either developers with operations experience or ops engineers with strong coding skills; many candidates mistakenly view it as low‑level sysadmin work.

Different companies interpret SRE differently – some keep the traditional title, others create specialised roles such as financial‑security SRE or game‑streaming SRE.

The work can be categorised into three layers:

Infrastructure SRE : manages hardware, networking, data‑center resources, CMDB, OS versions, basic software (NTP, monitoring agents), login methods, observability stack, and network services (NAT, DNS, firewalls, load balancers, CDN, certificates).

Platform SRE : builds and maintains shared services on top of the infrastructure, such as RPC, private cloud, queues (Kafka, RabbitMQ), cron jobs, caches, gateways, object storage (S3), databases (SQL, NoSQL, ES, Mongo), and internal developer tools (GitLab, CI/CD, Harbor, distributed compilation, Sentry).

Business SRE : works closely with product teams, designs fault‑tolerant systems, conducts load testing, capacity planning, on‑call support, and helps with incident response and service reliability.

Deployment is split into Day 1 (initial launch) and Day 2+ (continuous updates, scaling, configuration changes). Reliable deployment requires traceability (e.g., GitOps) and careful change management.

On‑call duties involve receiving alerts, validating real incidents, diagnosing root causes, and applying standard operating procedures (SOPs) to restore service, while continuously refining alert rules and monitoring dashboards.

SLI/SLO definitions must consider measurement granularity, aggregation periods, and error‑budget handling; they are essential for setting realistic availability expectations.

Post‑mortems should document timelines, actions, root‑cause analysis, and lessons learned without blaming individuals, and should be shared openly when possible.

Capacity planning requires modelling resource needs based on business growth, even though exact demand is hard to predict.

User support includes maintaining up‑to‑date documentation to reduce repetitive queries and providing technical assistance.

Career advice covers skill requirements (coding, system design, OS/network knowledge), interview topics (similar to backend development), choosing between large and small companies, and evaluating a company's SRE team size and culture.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations SRE Site Reliability Engineering Oncall SLI/SLO

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.