Operations 22 min read

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

Architect
Architect
Architect
A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

SRE (Site Reliability Engineering) originated at Google as a way to solve the conflict between rapid development and stable operations by applying software engineering to operational problems, focusing on standardisation, automation, scalability, and high availability.

Hiring SREs is challenging because the role requires either developers with operations experience or ops engineers with strong coding skills; many candidates mistakenly view it as low‑level sysadmin work.

Different companies interpret SRE differently – some keep the traditional title, others create specialised roles such as financial‑security SRE or game‑streaming SRE.

The work can be categorised into three layers:

Infrastructure SRE : manages hardware, networking, data‑center resources, CMDB, OS versions, basic software (NTP, monitoring agents), login methods, observability stack, and network services (NAT, DNS, firewalls, load balancers, CDN, certificates).

Platform SRE : builds and maintains shared services on top of the infrastructure, such as RPC, private cloud, queues (Kafka, RabbitMQ), cron jobs, caches, gateways, object storage (S3), databases (SQL, NoSQL, ES, Mongo), and internal developer tools (GitLab, CI/CD, Harbor, distributed compilation, Sentry).

Business SRE : works closely with product teams, designs fault‑tolerant systems, conducts load testing, capacity planning, on‑call support, and helps with incident response and service reliability.

Deployment is split into Day 1 (initial launch) and Day 2+ (continuous updates, scaling, configuration changes). Reliable deployment requires traceability (e.g., GitOps) and careful change management.

On‑call duties involve receiving alerts, validating real incidents, diagnosing root causes, and applying standard operating procedures (SOPs) to restore service, while continuously refining alert rules and monitoring dashboards.

SLI/SLO definitions must consider measurement granularity, aggregation periods, and error‑budget handling; they are essential for setting realistic availability expectations.

Post‑mortems should document timelines, actions, root‑cause analysis, and lessons learned without blaming individuals, and should be shared openly when possible.

Capacity planning requires modelling resource needs based on business growth, even though exact demand is hard to predict.

User support includes maintaining up‑to‑date documentation to reduce repetitive queries and providing technical assistance.

Career advice covers skill requirements (coding, system design, OS/network knowledge), interview topics (similar to backend development), choosing between large and small companies, and evaluating a company's SRE team size and culture.

monitoringoperationsSREInfrastructureSite Reliability EngineeringOncallSLI/SLO
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.