Operations 21 min read

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Understanding the SRE Role: Responsibilities, Types, and Practices

Many people ask about the SRE role; this blog provides an overview of what SRE is, its origins at Google, and its purpose of using software to solve operations problems through standardization, automation, scalability, and high availability.

SRE hiring is difficult because the role requires experience; it often seeks developers with operations experience or ops engineers with software skills, and many companies rename traditional ops positions as SRE.

Different companies define SRE differently; examples include Ant Financial’s two SRE types (stability and financial‑security) and Netflix’s small Core SRE team supporting a global service.

We can categorize SRE work into three layers:

Infrastructure : hardware, networking, IaaS‑like tasks (e.g., server procurement, CMDB, OS version management, basic software installation, login/permission management, observability stack, network services such as DNS, NAT, firewalls, load balancers, CDN, certificate management).

Platform : providing middleware and “as‑a‑service” components (e.g., RPC, private cloud, queues like Kafka/RabbitMQ, cron services, cache, gateway/reverse proxy, object storage, databases such as Elasticsearch, MongoDB, CI/CD systems, SCM, image registries, internal developer tools).

Business SRE : maintaining applications, participating in system design (circuit‑breakers, degradation, scaling), load testing, capacity planning, on‑call duties, and supporting developers.

Deployment is split into Day 1 (initial launch) and Day 2+ (continuous updates, scaling, rollback, gray‑release testing), emphasizing traceability (often via GitOps).

On‑call involves receiving alerts, verifying issues, diagnosing root causes, and applying SOPs; alerts and monitoring must be continuously refined, and SLI/SLO metrics are used to measure reliability and error budgets.

Incident post‑mortems aim to reduce future failures, requiring documented timelines, actions, root‑cause analysis, and transparent sharing (preferably anonymized).

Capacity planning is complex due to unpredictable growth, but modeling resources helps estimate needs for traffic spikes.

User support includes technical consulting and troubleshooting; good documentation reduces repetitive queries, and documentation should be clear, example‑driven, and regularly updated.

Career advice covers common misconceptions, the need for both development and ops skills, differences between small and large companies, interview topics (Kubernetes, monitoring, coding), and the importance of coding ability for SREs.

operationsSREinfrastructureSLOSite Reliability EngineeringOncallSLI
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.