Operations 27 min read

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

This article explains how Alibaba's Elastic Compute Service (ECS) SRE team tackled massive traffic, database bottlenecks, alert overload, and resource inconsistencies by establishing a full‑stack reliability organization, upgrading core components, automating pipelines, and instituting rigorous monitoring, incident response, and change‑management processes.

Alibaba Cloud Developer

Nov 29, 2019

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

Introduction

Alibaba’s Double‑11 shopping festival generated 268.4 billion CNY in sales, showcasing the power of its underlying technology. A key highlight was the group‑wide 100% cloud migration, with ECS (Elastic Compute Service) serving as the foundational product responsible for ensuring extreme stability and performance.

What Is SRE?

Site Reliability Engineering (SRE) originated at Google over a decade ago and has since been adopted by leading internet companies such as Netflix. Unlike traditional operations, SRE treats reliability as a software engineering problem, emphasizing capacity planning, monitoring, load balancing, on‑call rotation, firefighting, and close collaboration with product teams.

SRE Responsibilities

Infrastructure capacity planning

Production system monitoring

Load balancing

Release and change management

On‑call rotation and firefighting

Collaboration with business teams to resolve complex issues

Why Establish an SRE Team for ECS?

ECS supports billions of daily OpenAPI calls and peaks of millions of instance creations per day. This scale creates challenges such as database capacity limits, exploding slow‑SQL counts, excessive alert volume, workflow bottlenecks, high manual‑operation frequency, long‑tail request failures, resource inconsistency, and rising 5XX errors.

Key SRE Initiatives

Capacity & Performance

The team upgraded core components—including a lightweight workflow engine, idempotency framework, cache framework, and data‑cleanup framework—to handle massive data volumes (e.g., 3 TB+ of workflow data per month) and to provide reusable binary packages for other cloud products.

Performance Optimizations

JVM tuning to reduce GC pauses

Multi‑level caching to lower database I/O

SQL performance tuning

Core‑path response‑time improvements

Batch API processing to increase throughput

Full‑Link Stability Governance

Database Stability

Addressed space exhaustion, slow‑SQL spikes, high DDL failure rates, performance anomalies, and alert misconfigurations by combining database‑level optimizations (archiving, partitioning, index tuning) with business‑level refactoring (data sharding, intermediate table rotation).

Monitoring & Alerting

Reduced alert noise from >200 alerts per day by de‑duplicating, consolidating channels, standardizing severity levels, and automating repetitive alerts.

Fault Diagnosis

Implemented a 1‑5‑10 model (detect in 1 min, locate in 5 min, recover in 10 min) using full‑link tracing, fault‑scenario models, and impact‑analysis tools.

Service‑Level Objectives (SLO)

Established cross‑team SLO agreements, built visual dashboards, and drove upstream teams to meet reliability targets.

Resource Consistency

Created a data‑driven reconciliation system to detect and resolve inconsistencies across ECS, disks, and bandwidth resources, employing both offline (T+1) and near‑real‑time (hourly) reconciliation.

Process & Workflow Improvements

Development Process

Standardized design templates, introduced dual‑mode design reviews (online and offline), and enforced comprehensive checklists covering architecture, testing, monitoring, and rollback plans.

Code Review

Migrated to a unified Aone CodeReview platform with mandatory issue linking, static analysis, 100% unit‑test coverage, and adherence to Alibaba coding standards.

CI/CD Standardization

Unified CI pipelines across all core services, added parallel unit‑test execution, and automated coverage analysis to reduce manual intervention.

Environment Management

Containerized the full stack for both daily and isolated environments, mocked third‑party dependencies, and streamlined end‑to‑end testing.

Release & Change Management

Implemented automated pre‑release validation (FVT), enforced strict change‑review gates, and explored fully unattended releases driven by CI success metrics.

Stability Operations

On‑Call Rotation

Established a 24/7 on‑call schedule handling alert triage, emergency firefighting, deep‑root cause analysis, full‑link health checks, and post‑mortem documentation.

Fault Post‑Mortem Culture

Promoted blameless post‑mortems, documented actions in a shared knowledge base, and emphasized learning over blame.

Operational Metrics

Published daily stability reports (T+1 FBI) and bi‑weekly summaries covering core metrics, incident trends, and improvement plans.

Personal Reflections on SRE

The author clarifies common misconceptions (SRE is not merely ops, SRE must understand business), outlines a capability model spanning technical skills (development, operations, architecture, engineering) and soft skills (business knowledge, communication, teamwork, project management), and shares core SRE principles such as treating stability as a product, automating repetitive work, and prioritizing the most impactful 20% of effort.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations SRE cloud infrastructure Site Reliability Engineering

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.