What Does an SRE Actually Do? A Deep Dive into Roles and Practices
This article explains the origins of Site Reliability Engineering, breaks down its three main layers—Infrastructure, Platform, and Business SRE—covers day‑one and day‑2 deployment, on‑call processes, SLI/SLO design, post‑mortems, capacity planning, user support, and offers practical advice for aspiring SREs.
What is SRE?
Site Reliability Engineering (SRE) originated at Google as a discipline that applies software engineering principles to operations. The goal is to achieve standardisation, automation, scalability and high‑availability while reconciling rapid development cycles with stable production services.
Why hiring SREs is difficult
Effective SREs need either strong development experience combined with operations knowledge, or deep ops expertise together with solid coding skills. Misunderstanding SRE as merely “ops” narrows the talent pool and makes recruitment challenging.
Three‑layer model of SRE work
Infrastructure SRE : Manages hardware, networking, CMDB, server procurement, OS/kernel versioning, base agents (NTP, monitoring), authentication, and the observability stack (metrics, logs, traces). Also owns network services such as NAT, DNS, firewalls, L4/L7 load balancers, CDN and certificate management.
Platform SRE : Builds shared services that developers can consume out‑of‑the‑box, e.g., message queues (Kafka, RabbitMQ), caches, distributed cron jobs, RPC frameworks, object storage (S3‑compatible), databases (SQL, NoSQL, Elasticsearch, MongoDB), and internal developer tooling (GitLab, CI/CD pipelines, container registries, Sentry, etc.).
Business (Application) SRE : Owns the runtime of business‑critical services, participates in architecture design (circuit‑breaking, degradation, scaling), performs capacity planning and load testing, and handles on‑call duties to keep applications reliable.
Infrastructure SRE responsibilities
Procure servers, manage budgets, maintain a CMDB with ownership information for every asset.
Provide reliable VM or bare‑metal environments and enforce consistent OS and kernel versions.
Install and maintain base software agents (NTP, monitoring, logging).
Implement login methods, permission management and command‑audit mechanisms.
Operate the observability stack: metrics collection, log aggregation, distributed tracing.
Manage network connectivity, NAT, DNS, firewalls, L4/L7 load balancers, CDN and TLS certificate lifecycle.
Platform SRE responsibilities
Expose RPC services for inter‑service communication.
Provide private‑cloud APIs and self‑service portals.
Run queueing systems (Kafka, RabbitMQ) and caching layers.
Offer API gateways and reverse‑proxy configurations.
Maintain object storage (S3‑compatible) and various databases (relational, NoSQL, search).
Run internal developer environments: self‑hosted GitLab, CI/CD runners, container image registries (Harbor), distributed compilers and error‑tracking tools (Sentry).
Business SRE responsibilities
Collaborate with platform teams to troubleshoot issues and share tooling (Ansible, Puppet, Grafana, Prometheus).
Design service‑level resilience patterns (circuit‑breakers, graceful degradation, auto‑scaling).
Conduct performance and load testing to inform capacity models.
Participate in on‑call rotations, responding to alerts, diagnosing incidents and restoring service.
Deployment practices
Deployments are split into:
Day 1 : Initial launch of a service.
Day 2+ : Ongoing updates, configuration changes, migrations and scaling.
Day 2+ work requires robust change‑management, gray‑scale (canary) testing and reliable rollback procedures. GitOps is the common approach for tracking deployment state in version‑controlled repositories.
On‑call process
On‑call engineers receive alerts, verify whether they indicate a real incident, adjust alert rules and dashboards as needed, and follow a documented Standard Operating Procedure (SOP) to restore service quickly. Continuous refinement of alerts and monitoring thresholds is essential.
SLI / SLO definition and monitoring
Key considerations when defining Service Level Indicators (SLI) and Service Level Objectives (SLO):
Granularity – decide whether the metric is measured per instance, per zone or globally.
Measurement window – define the time bucket (e.g., 1 minute, 5 minutes) that determines up/down status.
Evaluation period – choose a rolling week, month or custom window.
Monitoring – implement real‑time dashboards that compute SLI values and compare them against SLO targets.
Error budget – establish actions (e.g., throttle releases) when the budget is exhausted.
Proper SLI/SLO tracking informs release decisions, capacity planning and internal service‑level agreements.
Post‑mortem practices
Post‑mortems should document a chronological timeline, actions taken, root‑cause analysis and concrete lessons learned. Anonymise individuals to encourage candid discussion, and focus on actionable improvements rather than excessive bureaucracy.
Capacity planning
Build a quantitative model of required machines, CPU, memory and network bandwidth based on projected traffic spikes and large‑scale events. Update the model regularly as usage patterns evolve.
User support and documentation
High‑quality, up‑to‑date documentation enables users to self‑solve common issues, reducing support load. Documentation should be concise, include concrete examples, and be reviewed periodically.
Career advice for aspiring SREs
Develop strong programming skills, system‑design knowledge, and a solid understanding of operating systems and networking.
Practice with real‑world projects; classroom courses alone are insufficient.
Interview topics overlap with backend engineering and often include Kubernetes, monitoring stacks and reliability concepts.
Interview‑question repository: https://github.com/bregman-arie/devops-exercises
Choosing between companies
Small companies often have a “firefighter” who knows the entire stack, providing rapid learning opportunities. Large companies offer specialised teams, deeper domain expertise and more mature processes. Evaluate the SRE‑to‑machine ratio and the amount of bureaucracy.
Assessing a company’s reliability culture
Look for a reasonable number of SREs relative to system size; over‑staffed SRE groups may indicate misaligned incentives. Inquire about the company’s stance on AIOps, which many practitioners view skeptically.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
