From Taobao to the Cloud: Secrets of Building Ultra‑High‑Availability Systems
This talk shares practical high‑availability strategies learned from Alibaba’s Taobao platform and Alibaba Cloud, covering traditional IDC stability, cache and disaster‑recovery designs, cloud‑native fault‑tolerance, performance‑capacity trade‑offs, traffic shaping, multi‑region replication, and lessons from real‑world incidents like GitLab failures.
In this presentation the speaker discusses high‑availability practices drawn from recent years of work on Alibaba’s e‑commerce platform and Alibaba Cloud, dividing the content into two parts: traditional Taobao store stability and modern public‑cloud HA design.
Traditional Taobao Store Stability
The Taobao shop system is a classic high‑concurrency web service that during peak Double‑11 events handled up to 200,000 web page requests per second, each page invoking about 20 RPC calls, resulting in roughly 4 million QPS. The architecture includes DNS, CDN, four‑layer and seven‑layer load balancers, a unified access layer, application clusters, storage, and cache services.
Performance and stability are balanced: adding machines alone is not always feasible, so careful capacity planning, caching, and efficient RPC handling are essential.
Operating‑system‑level profiling with perf revealed costly exceptions and GC overhead. Replacing a third‑party library that threw EOFException for control flow reduced CPU usage dramatically. JVM constant‑pool sizing also impacted performance.
Warm‑up issues were addressed by collecting hot methods in production and pre‑compiling them to native code before traffic arrives, similar to Azul’s ReadyNow or IBM J9 AOT.
Cache Strategies
During Double‑11 a shop’s core services processed 10 billion calls daily. A “rich client” was introduced: it first checks a distributed cache and only falls back to the backend service on a miss. With a 98 % hit rate, only 2 % of traffic reaches the backend, dramatically reducing required server capacity.
Two classic cache deployment models are described: a shared‑cluster model across two data centers, which halves data loss risk but may suffer reduced hit rates during a data‑center outage; and an independent‑deployment model, which maintains high hit rates at higher cost by replicating caches in each data center.
Database caching is also discussed. For MySQL InnoDB the buffer pool must be warm; otherwise a cold database cannot sustain traffic when a new data‑center is added.
Edge caching at CDN or API‑gateway level with very short TTL (50‑100 ms) can mitigate DDoS attacks by absorbing burst traffic.
Rate Limiting, Degradation, and Feature Toggles
Two classic HA techniques are rate limiting and degradation. Limiting upstream traffic when capacity is exceeded and limiting downstream usage when downstream services (e.g., databases) are stressed helps keep the overall system alive during spikes.
Feature toggles enable instant rollback to a previous version without a lengthy redeployment, crucial for large Java services where restart times can be minutes.
Disaster Recovery and Cloud‑Native Design
Real‑world failure cases are examined: a GitLab production database deletion that required extensive backup recovery, and an AWS S3 command error that took core services offline. These illustrate the need for regular fault‑injection drills and deep understanding of third‑party dependencies.
Designing for disaster recovery involves planning for component failures (e.g., new JARs, RPC services, distributed storage) and ensuring graceful degradation paths.
In the cloud era, resources can be provisioned via APIs (VMs, load balancers, storage). This enables software‑defined infrastructure (SDI) where infrastructure can be created or destroyed programmatically, supporting horizontal scaling and multi‑region designs.
Containers and micro‑services facilitate stateless service scaling. A typical architecture includes a load balancer, stateless front‑end services, stateless back‑end services, and underlying MQ, object storage, and databases.
Even cloud services can fail; therefore, designs must assume any component (VM, load balancer, message queue, database) may become unavailable and include mechanisms to cut traffic to healthy zones.
Choosing multiple Availability Zones (AZs) and Regions for resources (e.g., SLB, RDS) allows automatic failover. Health checks at the load‑balancer level and DNS failover are used to redirect traffic when a zone fails.
Cross‑region replication is illustrated: data written in Beijing is replicated to Shanghai for read‑only access, enabling rapid traffic cutover when a region experiences issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
