Operations 24 min read

How Alibaba’s ECS Team Built a Scalable SRE System: Lessons for Large R&D Teams

This article summarizes Alibaba Cloud Elastic Compute Service's four‑year SRE journey, covering why ECS created its own SRE organization, the five‑layer SRE framework, standards, automation platforms, empowerment practices, and team‑building insights that can guide large development teams toward reliable, high‑availability operations.

Efficient Ops

Jun 7, 2021

How Alibaba’s ECS Team Built a Scalable SRE System: Lessons for Large R&D Teams

SRE was first introduced by Google over a decade ago and has become widely known with the rise of DevOps. In China, many SRE groups resemble traditional operations, handling the technical operations behind internet services. Building an SRE function distinct from classic ops and embedding it in product teams is a challenge many enterprises face.

At the GOPS 2021 conference in Shenzhen, Alibaba Cloud Elastic Compute (ECS) expert Yang Zeqiang presented "Exploration and Practice of SRE in Large R&D Teams," sharing the thinking and implementation of ECS's SRE system.

1 Why ECS Built Its Own SRE System?

ECS built a dedicated SRE organization due to product characteristics and organizational background.

ECS is Alibaba Cloud's core cloud product, supporting countless internal and external services. Its stability requirements are extremely high, and API call volume grows severalfold each year, challenging capacity planning. Organizational changes also removed dedicated ops engineers, requiring the team to own operational responsibilities.

2 ECS SRE Exploration and Practice

Since 2018, ECS's SRE system has drawn from Google and Netflix and adapted to its own scale, forming five layers:

Foundation – Establish a full‑link stability governance system and performance‑capacity engineering.

Standardization – Define standards across the software lifecycle (design → coding → CR → testing → deployment → operation → decommission) and enforce them through training, automation, and regular reviews.

Platform – Build automation platforms to reduce manual SRE work.

Empowerment – Provide tools, guidance, and on‑call support to product teams, handling alerts, incident response, and fault recovery.

Team Building – Define SRE team responsibilities, culture, and hiring criteria.

Foundation

Core Frameworks and Performance Tuning

ECS runs large Java‑based distributed systems (with some Go and Python). Alibaba developed internal frameworks (lightweight BPM, idempotency, caching, data‑cleaning) that support billions of daily workflows with scheduling overhead under 5 ms. JVM tuning, WISP coroutines for I/O‑intensive workloads, and multi‑level caching reduce latency.

Full‑Link Stability Governance

Typical alert overload and low signal‑to‑noise ratio hinder troubleshooting. Two real alert‑handling stories illustrate the problem: a nighttime database outage missed by email alerts, and a chain‑reaction incident where early alerts were ignored due to volume.

Key practices include layered monitoring, a unified alert configuration platform, and optimized alert routing.

Database Stability Governance

Slow SQL and large tables are the main challenges. Slow‑SQL is collected, sent to SLS for near‑real‑time analysis, and assigned to responsible teams for fixing. Large‑table issues are addressed via historical data archiving rather than costly sharding.

High‑Availability Architecture

The HA model spans four layers: deployment (multi‑AZ), data (multi‑read, automatic read‑write downgrade), business (dependency isolation), and operational processes.

Deploy across multiple availability zones for better disaster tolerance.

Data layer includes multi‑read, automatic downgrade, and read‑write separation to keep core APIs available.

Business layer mitigates complex dependencies through design for failure and isolation.

A failure case showed a third‑party dependency slowdown causing thread blockage and system-wide latency explosion, highlighting the need for design‑for‑failure and failure‑as‑a‑service principles.

Standardization

ECS’s R&D team (>100 engineers) established standards for unit testing, CI, and code reviews. Automated pipelines enforce UT coverage, static analysis, and CI metrics before merging. Testing includes daily, pre‑release, functional, and gray‑release stages, with containerized environments for rapid provisioning.

Change management introduces checklists for DB changes, release batches, and middleware configuration, complemented by automated tools for log cleanup and process restarts.

Platform

The SRE automation platform implements high‑availability features (read‑write downgrade, rate limiting) via APIs and white‑box tools, enabling developers to adopt automated resilience.

ECS targets the “1‑5‑10” metric (detect in 1 min, diagnose in 5 min, recover in 10 min) through integrated monitoring, alerting, diagnosis, and rapid recovery loops, including automated root‑cause analysis and chaos engineering drills.

Empowerment

Full‑link SLO quantification tracks dozens of dependencies and hundreds of core APIs. By defining SLIs, setting SLO targets, and visualizing real‑time/offline reports, ECS raised dependency availability from ~40 % to >98 %.

A knowledge base stores incident post‑mortems, frameworks, and tools, shared across Alibaba Cloud products.

Team Building

SRE hires are T‑shaped: deep expertise in one language, plus abstraction and standardization skills, and a global view for empowerment.

Core team responsibilities include building standards, automation platforms, foundational services, on‑call support, and fostering a stability‑first culture through daily/weekly/monthly reports, live streams, and training.

3 Review and Outlook

Four‑year SRE evolution: Build System → Quantify → Automate → Intelligent.

Year 1: Systematic exploration, establishing foundations.

Year 2: SLO quantification across all dependencies.

Year 3: Automation of design, coding, testing, deployment, and incident response.

Year 4: Intelligent automation, including AI‑driven alert root‑cause analysis.

Future view: stability is a product, SRE skills will converge with development skills, and automation will become increasingly intelligent.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SRE reliability engineering

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.