How to Build a Practical SRE Operations Framework for Large‑Scale Systems
This article presents a hands‑on SRE framework covering the full product lifecycle—code development, resource planning, deployment, operational reliability, and decommissioning—derived from real‑world practices at Xiaomi and Sina to help teams manage massive internet services efficiently and cost‑effectively.
Overview
This document describes a pragmatic SRE workflow for large‑scale internet services. It divides the service lifecycle into five phases—code development, resource planning, system launch, operational assurance, and system decommission—and outlines the concrete responsibilities, processes, and tooling that SRE teams should adopt to ensure user‑centric stability, cost efficiency, and automated operations.
1. Code Development Phase
The primary task is to provision and maintain a source‑code management platform (typically GitLab) and to lay the groundwork for CI/CD pipelines.
GitLab setup : create project namespaces, configure access controls, and enable merge‑request workflows.
CI/CD preparation : define build, test, and deployment jobs; store pipeline definitions (e.g., .gitlab-ci.yml) in the repository.
2. Resource Planning Phase
System resources are evaluated, requested, and managed through a structured workflow.
Solution evaluation : specify hardware specs (CPU, memory, storage), data‑center location, network layer (L4/L7) design, domain certificates, and CDN usage.
Resource request : submit each evaluated item to an approval process (e.g., ticketing system) before provisioning.
CMDB management : record hosts, containers, domains, certificates, load balancers, CDNs, storage, and network links in a configuration‑management database; enforce a consistent naming convention.
Permission segregation : grant developers read‑only access to production resources; only SREs receive root or privileged rights. Auditing is mandatory for all privileged actions.
Bulk operations : use automation tools such as ansible or saltstack together with secure jump hosts to execute large‑scale changes (e.g., OS patching, certificate renewal).
3. System Launch Phase
Four isolated environments are defined and prepared before production release.
Environment planning : design dev, staging, pre‑release, and prod environments with network segmentation (e.g., VLANs, security groups) to prevent accidental data promotion.
Environment provisioning : decide between bare‑metal and cloud instances based on cost‑performance trade‑offs; configure host OS, kernel parameters, and container runtimes as needed.
CI/CD integration : co‑create deployment pipelines, define approval gates, and assign responsibilities—code‑related releases are typically performed by developers, while infrastructure scaling and migration are handled by SREs.
4. Operational Assurance Phase
This phase focuses on reliability, capacity, and incident management.
4.1 Change Control
Enforce strict change‑management policies (e.g., change tickets, peer review, automated testing) to reduce fault‑inducing deployments.
4.2 Capacity Management
Continuously monitor CPU, memory, and storage utilization; implement auto‑scaling rules; and conduct periodic right‑sizing reviews to avoid over‑provisioning.
4.3 Disaster Recovery
Build hot‑standby solutions for access‑layer, data‑center, host, and data‑level components. Prefer hot standby over cold standby to minimize waste.
4.4 Business Inspection
Run regular health checks on machine metrics (CPU, memory, I/O, network) and business KPIs (ERROR rate, QPS, latency) across the full service chain.
4.5 Critical Event Support
Treat major incidents as independent projects with defined start, end, and closure criteria.
4.6 Incident Drills
Schedule periodic fault‑simulation exercises, validate runbooks, and measure mean‑time‑to‑detect (MTTD) and mean‑time‑to‑recover (MTTR).
4.7 Log Management
Standardize collection of system logs (e.g., syslog, Nginx) and application logs; store them in a centralized log platform; define ownership (SRE for infrastructure logs, shared for application logs).
4.8 Technical Tuning
Adjust kernel parameters, service configurations, and architecture components (e.g., voice‑access layer) to improve latency and throughput.
4.9 Service Governance
Maintain a service registry containing owner, domain, deployment location, key metrics, and SLA definitions; manage onboarding/offboarding workflows.
4.10 On‑Call Management
Follow the “receive alerts → assist investigation → communicate frequently → close loop” principle. Separate alert duty from routine ticket handling.
Incident Lifecycle
Pre‑incident : Reduce fault ingress by configuring precise, low‑noise alerts; adopt “life‑or‑death” alerts for critical services.
During incident : Use monitoring dashboards and link‑level architecture views to pinpoint the failing component quickly.
Post‑incident : Conduct blameless post‑mortems, produce a fault report, and implement corrective actions to prevent recurrence.
5. System Decommission Phase
When a service is retired, systematically reclaim all associated assets—servers, domains, load balancers, ACLs, and network links—to avoid orphaned resources.
Automation, quality awareness, cost consciousness, and a product‑owner mindset are essential throughout the lifecycle.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
