How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services
This article explains the origins of Site Reliability Engineering (SRE), outlines the responsibilities of SRE teams, and details Alibaba Cloud’s ECS SRE practices—including capacity planning, performance optimization, full‑stack stability governance, automated release pipelines, on‑call processes, and the core principles and mindset that guide modern SRE work.
What is Site Reliability Engineering (SRE)
SRE is a software‑engineering discipline that applies engineering methods to achieve reliability, availability and performance of large‑scale services. It originated at Google and has been adopted by many cloud providers.
Typical SRE responsibilities (per Site Reliability Engineering book)
Infrastructure capacity planning
Production system monitoring and alerting
Load balancing and traffic shaping
Release and change management
On‑call rotation and firefighting
Collaboration with product teams to resolve complex incidents
Why an SRE team is needed for Alibaba Cloud Elastic Compute Service (ECS)
ECS OpenAPI receives hundreds of millions of calls per day and creates up to one million instances daily. This scale creates:
Database storage exhaustion and slow‑SQL spikes
200+ alerts per day, many noisy or duplicate
Workflow engine bottlenecks
High manual‑ops frequency and long‑tail request spikes
Resource‑state inconsistency across multiple systems
Addressing these issues requires a dedicated SRE organization that can provide top‑down visibility and systematic governance.
ECS SRE Team – Core Technical Practices
1. Capacity and Performance Engineering
All core components (workflow engine, idempotent framework, cache framework, asynchronous data‑cleanup framework) were refactored and released as binary packages for reuse by other cloud products.
Component upgrades : The lightweight workflow engine, originally built in 2014 and revamped in 2018, now supports hundreds of millions of workflow instances. The idempotent, cache and data‑cleanup frameworks were also modernized.
Performance tuning : JVM parameter optimization to reduce GC pauses, multi‑level caching, SQL statement tuning, critical‑path latency reduction, and batch API processing.
2. Full‑Stack Stability Governance
Stability is treated as a product with measurable Service‑Level Objectives (SLOs) and a top‑down governance model.
Database stability : Combine database‑level actions (archiving, partitioning, DDL review) with business‑level mitigations to control space growth, limit slow‑SQL, and reduce DDL failure rates.
Monitoring & alert governance : Consolidate >100 daily alerts, de‑duplicate channels, fix misconfigurations, and automate repetitive alert handling.
Fault diagnosis : Build a trace model (TraceID propagation) and train diagnosis models for high‑frequency failure scenarios; automate impact analysis.
Full‑stack SLO : Define and visualize SLOs across upstream and downstream services to maintain 99.999% availability.
Resource consistency : Apply CAP principles, maintain a data‑driven reconciliation system, and run both offline (T+1) and near‑real‑time (hourly) reconciliation jobs.
3. Process & Workflow System
Standardized development, testing and release pipelines support hundreds of parallel developers and thousands of daily releases.
Design workflow : Unified design templates covering architecture, detailed design, test cases, monitoring, gray‑release and rollout plans; design reviews conducted both online and offline.
Code review : Migrated from unstable GitLab to Aone CodeReview with mandatory checklists (issue linkage, static analysis, 100% unit‑test coverage, coding standards, business‑critical reviews, MR documentation).
CI standardization : A common CI pipeline:
prepare environment
run unit tests
run coverage analysis
...Parallelized unit‑test execution reduces CI time.
Environment integration : Full‑stack containerization, third‑party service mocking, and unified daily & isolated test environments.
Pre‑release governance : Pre‑release and production use identical databases; DDL changes require review; CI must pass before deployment.
Functional verification testing (FVT) : Nightly OpenAPI functional tests with 100% pass rate act as the final gate for daily releases.
Unattended release : Automated pre‑release deployment, auto‑gate based on FVT results, and fully automated production release once all CI checks succeed.
Change management : Integrated with corporate strong‑control (GOC), white‑screen change processes and automated approvals.
4. Stability Operations
Stability is operated like a product with regular reporting and knowledge sharing.
On‑call duties : Alert handling, emergency firefighting, deep root‑cause analysis, full‑stack health checks, and post‑mortem documentation.
On‑call onboarding : Templated runbooks, knowledge‑base articles, and hands‑on rotation.
Post‑mortem practice : Focus on learning, involve owners, conduct reviews, and track action items.
Daily / bi‑weekly stability reports : Aggregate key metrics (workflow success rate, API success, resource consistency, loss) to surface issues early.
SRE Mindset, Capability Model and Core Principles
The author emphasizes that SRE is not merely operations; it requires deep business understanding, software‑engineering rigor, and cross‑team collaboration.
Technical capabilities : Development, operations, architecture design, engineering (reverse‑engineering, large‑scale system design).
Soft skills : Business domain knowledge, communication, teamwork, project management.
Core principles :
Apply software‑engineering methods to reliability problems.
Automate repetitive work.
Treat stability as a product with clear SLOs.
Empower teams through shared tooling and standards.
Focus on the vital 20% that solves 80% of problems.
These principles guide the ECS SRE team’s continuous improvement of capacity, performance, stability governance, and release automation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
