Operations 20 min read

High Availability Practices: From Taobao to Cloud

This talk shares practical high‑availability strategies learned from years of building Taobao’s massive e‑commerce platform and migrating to Alibaba Cloud, covering traditional IDC stability mechanisms, cache and disaster‑recovery designs, cloud‑native fault‑tolerance, capacity planning, rate‑limiting, graceful degradation, and multi‑region resilience.

Architecture Digest

Sep 23, 2021

The speaker presents the topic "High Availability Practices: From Taobao to Cloud", explaining how traditional IDC stability methods and modern cloud‑native designs differ and complement each other.

In the first part, he describes the stability system built for Taobao shop platforms, including link design, caching, and disaster‑recovery solutions that supported billions of daily requests and peak QPS of millions during Double‑11 sales.

The request flow is detailed: DNS resolves a CDN VIP, CDN forwards to a four‑layer and seven‑layer load balancer, then traffic reaches the application cluster, which interacts with caches and backend services. The system handled up to 400 万 QPS, requiring careful balance between performance, stability, and cost.

He emphasizes the relationship between performance and stability, noting that sufficient capacity planning is essential for handling massive traffic spikes without sacrificing reliability.

Using the Linux perf tool, the team identified high CPU overhead caused by excessive exception handling and JVM constant‑pool lookups, leading to code‑level optimizations such as replacing problematic third‑party libraries.

To reduce warm‑up latency, they collect hot methods in production and pre‑compile them to native code (AOT), leveraging tools like Azul Zing ReadyNow, IBM J9, and an open‑source agent that visualizes Java method costs in perf.

Cache strategies are discussed: a rich client first checks a distributed cache, achieving >98 % hit rate and dramatically reducing backend load. Two deployment models are compared—shared‑cluster across data centers versus independent full‑replica clusters—highlighting trade‑offs between cost and availability.

Database caching considerations include warming the InnoDB buffer pool before traffic shifts, and the importance of consistent cache‑database behavior for high‑consistency workloads.

Rate limiting and degradation are presented as classic HA techniques: upstream throttling protects the system from overload, while downstream throttling safeguards dependent services. Regular fault‑injection drills validate the resilience of these mechanisms.

Feature toggles enable instant rollback of problematic logic, avoiding prolonged downtime during large‑scale releases.

Disaster‑recovery design is integrated into every solution, with clear fallback procedures for each dependent service, and rigorous dependency analysis to prevent hidden failure points.

The speaker introduces cloud‑native HA concepts such as Software‑Defined Infrastructure (SDI), where APIs can provision or destroy VMs, load balancers, and storage on demand, enabling flexible horizontal scaling.

Containerization and micro‑services are advocated to achieve stateless services that can be auto‑scaled horizontally, provided underlying storage remains reliable.

He warns that cloud services themselves can fail; therefore, health checks, DNS failover, and multi‑AZ deployments are essential to maintain service continuity.

Traffic cutover is the fastest way to recover from failures; multi‑region active‑active designs allow rapid redirection of traffic to healthy zones while ensuring sufficient capacity.

Choosing multiple regions and availability zones for VMs, load balancers, and databases is crucial for building resilient architectures.

A typical public‑cloud architecture diagram is described, showing two cities (Beijing and Shanghai) each with multiple zones, cross‑region replication, and load‑balancing strategies to achieve high availability.

The presentation concludes with a call for engineers to join the team, emphasizing the opportunity to work on cutting‑edge high‑availability systems for massive e‑commerce events.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Cloud Computing Caching capacity planning fault tolerance

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.