Operations 28 min read

How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services

This article explains the origins of Site Reliability Engineering (SRE), outlines the responsibilities of SRE teams, and details Alibaba Cloud’s ECS SRE practices—including capacity planning, performance optimization, full‑stack stability governance, automated release pipelines, on‑call processes, and the core principles and mindset that guide modern SRE work.

dbaplus Community
dbaplus Community
dbaplus Community
How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services

What is Site Reliability Engineering (SRE)

SRE is a software‑engineering discipline that applies engineering methods to achieve reliability, availability and performance of large‑scale services. It originated at Google and has been adopted by many cloud providers.

Typical SRE responsibilities (per Site Reliability Engineering book)

Infrastructure capacity planning

Production system monitoring and alerting

Load balancing and traffic shaping

Release and change management

On‑call rotation and firefighting

Collaboration with product teams to resolve complex incidents

Why an SRE team is needed for Alibaba Cloud Elastic Compute Service (ECS)

ECS OpenAPI receives hundreds of millions of calls per day and creates up to one million instances daily. This scale creates:

Database storage exhaustion and slow‑SQL spikes

200+ alerts per day, many noisy or duplicate

Workflow engine bottlenecks

High manual‑ops frequency and long‑tail request spikes

Resource‑state inconsistency across multiple systems

Addressing these issues requires a dedicated SRE organization that can provide top‑down visibility and systematic governance.

ECS SRE Team – Core Technical Practices

1. Capacity and Performance Engineering

All core components (workflow engine, idempotent framework, cache framework, asynchronous data‑cleanup framework) were refactored and released as binary packages for reuse by other cloud products.

Component upgrades : The lightweight workflow engine, originally built in 2014 and revamped in 2018, now supports hundreds of millions of workflow instances. The idempotent, cache and data‑cleanup frameworks were also modernized.

Performance tuning : JVM parameter optimization to reduce GC pauses, multi‑level caching, SQL statement tuning, critical‑path latency reduction, and batch API processing.

ECS capacity and performance diagram
ECS capacity and performance diagram

2. Full‑Stack Stability Governance

Stability is treated as a product with measurable Service‑Level Objectives (SLOs) and a top‑down governance model.

Database stability : Combine database‑level actions (archiving, partitioning, DDL review) with business‑level mitigations to control space growth, limit slow‑SQL, and reduce DDL failure rates.

Monitoring & alert governance : Consolidate >100 daily alerts, de‑duplicate channels, fix misconfigurations, and automate repetitive alert handling.

Fault diagnosis : Build a trace model (TraceID propagation) and train diagnosis models for high‑frequency failure scenarios; automate impact analysis.

Full‑stack SLO : Define and visualize SLOs across upstream and downstream services to maintain 99.999% availability.

Resource consistency : Apply CAP principles, maintain a data‑driven reconciliation system, and run both offline (T+1) and near‑real‑time (hourly) reconciliation jobs.

Stability governance diagram
Stability governance diagram

3. Process & Workflow System

Standardized development, testing and release pipelines support hundreds of parallel developers and thousands of daily releases.

Design workflow : Unified design templates covering architecture, detailed design, test cases, monitoring, gray‑release and rollout plans; design reviews conducted both online and offline.

Code review : Migrated from unstable GitLab to Aone CodeReview with mandatory checklists (issue linkage, static analysis, 100% unit‑test coverage, coding standards, business‑critical reviews, MR documentation).

CI standardization : A common CI pipeline:

prepare environment
run unit tests
run coverage analysis
...

Parallelized unit‑test execution reduces CI time.

Environment integration : Full‑stack containerization, third‑party service mocking, and unified daily & isolated test environments.

Pre‑release governance : Pre‑release and production use identical databases; DDL changes require review; CI must pass before deployment.

Functional verification testing (FVT) : Nightly OpenAPI functional tests with 100% pass rate act as the final gate for daily releases.

Unattended release : Automated pre‑release deployment, auto‑gate based on FVT results, and fully automated production release once all CI checks succeed.

Change management : Integrated with corporate strong‑control (GOC), white‑screen change processes and automated approvals.

4. Stability Operations

Stability is operated like a product with regular reporting and knowledge sharing.

On‑call duties : Alert handling, emergency firefighting, deep root‑cause analysis, full‑stack health checks, and post‑mortem documentation.

On‑call onboarding : Templated runbooks, knowledge‑base articles, and hands‑on rotation.

Post‑mortem practice : Focus on learning, involve owners, conduct reviews, and track action items.

Daily / bi‑weekly stability reports : Aggregate key metrics (workflow success rate, API success, resource consistency, loss) to surface issues early.

Stability operations diagram
Stability operations diagram

SRE Mindset, Capability Model and Core Principles

The author emphasizes that SRE is not merely operations; it requires deep business understanding, software‑engineering rigor, and cross‑team collaboration.

Technical capabilities : Development, operations, architecture design, engineering (reverse‑engineering, large‑scale system design).

Soft skills : Business domain knowledge, communication, teamwork, project management.

Core principles :

Apply software‑engineering methods to reliability problems.

Automate repetitive work.

Treat stability as a product with clear SLOs.

Empower teams through shared tooling and standards.

Focus on the vital 20% that solves 80% of problems.

These principles guide the ECS SRE team’s continuous improvement of capacity, performance, stability governance, and release automation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud computingAutomationOperationsSRESite Reliability Engineering
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.