Cloud Computing 56 min read

How Alibaba Cloud Achieves Rock‑Solid IaaS Stability: Design Principles, Metrics, and Engineering Practices

This article explains Alibaba Cloud's comprehensive approach to IaaS stability, covering shared responsibility with customers, availability metrics, design principles, compute, storage, and network engineering practices that together deliver rock‑solid reliability for millions of workloads.

Alibaba Cloud Infrastructure

Aug 20, 2025

How Alibaba Cloud Achieves Rock‑Solid IaaS Stability: Design Principles, Metrics, and Engineering Practices

1. Introduction – Infrastructure Stability Philosophy

Alibaba Cloud’s IaaS services (compute, storage, network) form the core of both external customer workloads and Alibaba Group’s own critical systems. High‑availability at the application layer alone cannot guarantee seamless service during IaaS incidents; therefore, Alibaba Cloud pursues extreme stability at the infrastructure level.

Shared Responsibility: Alibaba provides highly reliable hardware, software, network, and data‑center facilities, while customers must select appropriate cloud products, configure availability settings, and follow design guidelines to build resilient applications.

Stability Measurement and Goals

Availability targets (e.g., 99.9%, 99.95%) are used to quantify uptime. Alibaba Cloud defines three core dimensions for stability:

Reduce the number of failures

Minimize the impact scope of each failure

Shorten the recovery time

Metrics cover interruption frequency, downtime, and affected scope across the full lifecycle (design, release, incident response).

2. Compute – Building ECS Stability

2.1 Goals and Metrics

ECS aims to deliver “x86‑based small‑machine stability” with a comprehensive metric system that includes backend technical indicators (instance crashes, performance jitter, etc.) and customer‑side feedback (ticket rate, NPS).

2.2 Stability Engineering

Key practices include:

Deep Autonomous Control: Self‑developed hardware, OS (AliOS), and storage platforms eliminate third‑party blind spots.

Multi‑Layer Redundancy: Multi‑AZ deployments, cluster‑level load balancing, and automatic failover reduce failure frequency.

Isolation Design: Unit‑level fault domains prevent a single component failure from cascading.

Resource Reservation: Over‑provisioning ensures service continuity during AZ outages.

2.3 Controlled Change Management

Changes follow graded approval, automated pipelines, and extensive gray‑release testing. High‑risk updates undergo multi‑level validation, including unit, integration, chaos, and full‑link stress tests.

2.4 “Fail‑Ops” – Proactive Fault Injection

Continuous chaos engineering, automated detection, and post‑mortem analysis turn incidents into permanent safeguards.

2.5 AI‑Ops

Machine‑learning models analyze massive monitoring data to predict anomalies early; large language models assist in log analysis and automated remediation.

3. Storage – Multi‑Layer Data Reliability

3.1 Stability Risks

Risks stem from hardware failures, software bugs, and human error. OSS (Object Storage Service) mitigates these through data redundancy, rigorous change control, and silent‑error detection.

3.2 Data Reliability Guarantees

OSS employs LRS/ZRS architectures with erasure coding (EC) to achieve 12‑nine data durability. Data blocks are distributed across racks and AZs, and AZ‑level coding reduces bandwidth amplification during recovery.

3.3 Change Control Mechanism

Fine‑grained gray releases, automated anomaly detection, and full‑stack validation ensure that new code or hardware does not compromise data integrity.

3.4 Silent‑Error Handling

Scrub scans for HDD silent errors, CPU SDE detection tools, and CRC checks on write/read paths detect and correct silent faults before they affect customers.

3.5 Data‑Driven Operations

Real‑time metrics drive capacity planning and automatic throttling during AZ failures, preserving performance and availability.

4. Network – Cloud Network Stability

4.1 Full‑Stack Self‑Developed Benefits

Alibaba’s LoShen network stack (SDN controller, programmable switches, CStar elastic data‑plane) eliminates reliance on black‑box components, enabling rapid root‑cause isolation and sub‑second recovery.

4.2 Proactive Fault‑Domain Isolation

Techniques such as AZ isolation, horizontal sharding, and shuffle sharding limit blast radius, ensuring that a single failure impacts only a small, random subset of tenants.

4.3 Static Stability and Resource Reservation

Critical control‑plane services reserve sufficient resources to avoid overload, while data‑plane convergence handles failures within milliseconds.

4.4 Chaos Engineering & “Fail‑Ops”

Continuous fault injection (e.g., link degradation, AZ loss) validates resilience; lessons are encoded into automated safeguards.

4.5 Deterministic Change Management

Standardized, white‑screen releases with automated rollback and AI‑driven monitoring provide zero‑downtime deployments at massive scale.

5. Conclusion

Reliance on application‑level high availability is insufficient for modern cloud workloads. Alibaba Cloud therefore invests heavily in infrastructure‑level stability across compute, storage, and network, combining deep self‑control, rigorous metrics, proactive fault injection, and AI‑driven operations to deliver “rock‑solid” IaaS services that support both external customers and Alibaba Group’s own critical systems.

high availability system design IaaS infrastructure stability

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.