Operations 16 min read

How to Build a Practical SRE Operations Framework for Large‑Scale Systems

This article presents a hands‑on SRE framework covering the full product lifecycle—code development, resource planning, deployment, operational reliability, and decommissioning—derived from real‑world practices at Xiaomi and Sina to help teams manage massive internet services efficiently and cost‑effectively.

dbaplus Community

Apr 10, 2022

How to Build a Practical SRE Operations Framework for Large‑Scale Systems

Overview

This document describes a pragmatic SRE workflow for large‑scale internet services. It divides the service lifecycle into five phases—code development, resource planning, system launch, operational assurance, and system decommission—and outlines the concrete responsibilities, processes, and tooling that SRE teams should adopt to ensure user‑centric stability, cost efficiency, and automated operations.

1. Code Development Phase

The primary task is to provision and maintain a source‑code management platform (typically GitLab) and to lay the groundwork for CI/CD pipelines.

GitLab setup : create project namespaces, configure access controls, and enable merge‑request workflows.

CI/CD preparation : define build, test, and deployment jobs; store pipeline definitions (e.g., .gitlab-ci.yml) in the repository.

2. Resource Planning Phase

System resources are evaluated, requested, and managed through a structured workflow.

Solution evaluation : specify hardware specs (CPU, memory, storage), data‑center location, network layer (L4/L7) design, domain certificates, and CDN usage.

Resource request : submit each evaluated item to an approval process (e.g., ticketing system) before provisioning.

CMDB management : record hosts, containers, domains, certificates, load balancers, CDNs, storage, and network links in a configuration‑management database; enforce a consistent naming convention.

Permission segregation : grant developers read‑only access to production resources; only SREs receive root or privileged rights. Auditing is mandatory for all privileged actions.

Bulk operations : use automation tools such as ansible or saltstack together with secure jump hosts to execute large‑scale changes (e.g., OS patching, certificate renewal).

3. System Launch Phase

Four isolated environments are defined and prepared before production release.

Environment planning : design dev, staging, pre‑release, and prod environments with network segmentation (e.g., VLANs, security groups) to prevent accidental data promotion.

Environment provisioning : decide between bare‑metal and cloud instances based on cost‑performance trade‑offs; configure host OS, kernel parameters, and container runtimes as needed.

CI/CD integration : co‑create deployment pipelines, define approval gates, and assign responsibilities—code‑related releases are typically performed by developers, while infrastructure scaling and migration are handled by SREs.

4. Operational Assurance Phase

This phase focuses on reliability, capacity, and incident management.

4.1 Change Control

Enforce strict change‑management policies (e.g., change tickets, peer review, automated testing) to reduce fault‑inducing deployments.

4.2 Capacity Management

Continuously monitor CPU, memory, and storage utilization; implement auto‑scaling rules; and conduct periodic right‑sizing reviews to avoid over‑provisioning.

4.3 Disaster Recovery

Build hot‑standby solutions for access‑layer, data‑center, host, and data‑level components. Prefer hot standby over cold standby to minimize waste.

4.4 Business Inspection

Run regular health checks on machine metrics (CPU, memory, I/O, network) and business KPIs (ERROR rate, QPS, latency) across the full service chain.

4.5 Critical Event Support

Treat major incidents as independent projects with defined start, end, and closure criteria.

4.6 Incident Drills

Schedule periodic fault‑simulation exercises, validate runbooks, and measure mean‑time‑to‑detect (MTTD) and mean‑time‑to‑recover (MTTR).

4.7 Log Management

Standardize collection of system logs (e.g., syslog, Nginx) and application logs; store them in a centralized log platform; define ownership (SRE for infrastructure logs, shared for application logs).

4.8 Technical Tuning

Adjust kernel parameters, service configurations, and architecture components (e.g., voice‑access layer) to improve latency and throughput.

4.9 Service Governance

Maintain a service registry containing owner, domain, deployment location, key metrics, and SLA definitions; manage onboarding/offboarding workflows.

4.10 On‑Call Management

Follow the “receive alerts → assist investigation → communicate frequently → close loop” principle. Separate alert duty from routine ticket handling.

Incident Lifecycle

Pre‑incident : Reduce fault ingress by configuring precise, low‑noise alerts; adopt “life‑or‑death” alerts for critical services.

During incident : Use monitoring dashboards and link‑level architecture views to pinpoint the failing component quickly.

Post‑incident : Conduct blameless post‑mortems, produce a fault report, and implement corrective actions to prevent recurrence.

5. System Decommission Phase

When a service is retired, systematically reclaim all associated assets—servers, domains, load balancers, ACLs, and network links—to avoid orphaned resources.

Automation, quality awareness, cost consciousness, and a product‑owner mindset are essential throughout the lifecycle.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Resource Management SRE Incident Management System Lifecycle

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.