From Single Server to Cloud‑Native: How Taobao Scaled to Millions of Users
This article traces Taobao’s architectural evolution—from a single‑machine setup to distributed clusters, caching layers, load‑balancing proxies, database sharding, microservices, ESB, containerization, and finally cloud‑native deployment—highlighting the technologies and design principles that enable scaling from hundreds to tens of millions of concurrent users.
1. Overview
This article uses Taobao as an example to illustrate the evolution of server‑side architecture from handling a few hundred requests to supporting tens of millions of concurrent users, listing the technologies encountered at each stage and summarizing key design principles.
Note: The example of Taobao is for illustration only and does not reflect the actual technical evolution of Taobao.
2. Basic Concepts
Before discussing architecture, the article defines fundamental concepts:
Distributed : Multiple modules deployed on different servers, e.g., Tomcat and database on separate machines.
High Availability : The system continues to provide service when some nodes fail.
Cluster : A group of servers offering a unified service, with automatic failover.
Load Balancing : Evenly distributing requests across multiple nodes.
Forward and Reverse Proxy : Forward proxy handles outbound requests from internal systems; reverse proxy handles inbound requests from external clients.
3. Architecture Evolution
3.1 Single‑Machine Architecture
Initially, Tomcat and the database run on the same server. As user count grows, resource contention makes this setup insufficient.
Tomcat and database compete for resources, and single‑machine performance cannot sustain the business.
3.2 First Evolution: Separate Tomcat and Database
Deploy Tomcat and the database on separate servers, significantly improving performance of each.
Concurrent reads/writes to the database become the new bottleneck.
3.3 Second Evolution: Introduce Local and Distributed Caches
Add local cache (e.g., memcached) and a distributed cache (e.g., Redis) to store hot product data or HTML pages, reducing database load. Issues such as cache consistency, penetration, breakdown, and hot‑data expiration are discussed.
Cache handles most traffic, but Tomcat becomes the new performance bottleneck.
3.4 Third Evolution: Reverse Proxy for Load Balancing
Deploy multiple Tomcat instances behind a reverse proxy (Nginx or HAProxy) to distribute requests. This raises the concurrent capacity dramatically.
Reverse proxy increases application capacity, but the database becomes the next bottleneck.
3.5 Fourth Evolution: Database Read/Write Separation
Separate read and write databases (e.g., using Mycat) and synchronize data from the write master to multiple read replicas.
Different business workloads compete for database resources, affecting performance.
3.6 Fifth Evolution: Business‑Level Database Sharding
Store data for different business domains in separate databases, reducing contention. Cross‑business queries require additional solutions.
Write database eventually hits performance limits as user count rises.
3.7 Sixth Evolution: Split Large Tables into Small Tables
Hash‑based routing and time‑based table partitioning reduce per‑table size, enabling horizontal scaling. This leads to distributed databases and MPP (massively parallel processing) architectures such as Greenplum, TiDB, PostgreSQL‑XC, etc.
Both Tomcat and database can scale horizontally, but Nginx eventually becomes the bottleneck.
3.8 Seventh Evolution: LVS or F5 for Multi‑Nginx Load Balancing
Introduce Layer‑4 load balancers (LVS software or F5 hardware) to distribute traffic among multiple Nginx instances, achieving higher concurrency. High availability is ensured with keepalived and virtual IPs.
LVS also becomes a bottleneck at hundreds of thousands of concurrent connections.
3.9 Eighth Evolution: DNS Round‑Robin Across Data Centers
Configure DNS to return multiple IPs, each pointing to a different data‑center virtual IP, achieving data‑center‑level load balancing and horizontal scaling.
Database alone cannot satisfy increasingly rich analytical and retrieval requirements.
3.10 Ninth Evolution: Introduce NoSQL and Search Engines
Adopt specialized solutions such as HDFS for file storage, HBase/Redis for key‑value, ElasticSearch for full‑text search, and Kylin/Druid for OLAP analytics.
Adding many components increases system complexity and operational overhead.
3.11 Tenth Evolution: Split Large Application into Smaller Services
Divide the codebase by business modules, allowing independent upgrades. Shared configuration can be managed via Zookeeper.
Duplicated common modules across applications make coordinated upgrades difficult.
3.12 Eleventh Evolution: Extract Reusable Functions as Microservices
Common functionalities (user management, order, payment, authentication) become independent microservices accessed via HTTP, TCP, or RPC. Service governance tools such as Dubbo or Spring Cloud provide rate limiting, circuit breaking, etc.
Different access protocols and inter‑service calls increase complexity.
3.13 Twelfth Evolution: Enterprise Service Bus (ESB) for Unified Access
Use an ESB to abstract protocol differences, allowing applications and services to communicate uniformly, resembling SOA architecture.
Increasing number of services and deployment environments makes operations challenging, especially for dynamic scaling during peak events.
3.14 Thirteenth Evolution: Containerization
Adopt Docker for container images and Kubernetes for orchestration, enabling rapid deployment, isolation, and scaling of services.
While containers solve dynamic scaling, the underlying machines still need to be provisioned and managed, leading to under‑utilized resources outside peak periods.
3.15 Fourteenth Evolution: Move to Cloud Platform
Deploy the system on public cloud (IaaS, PaaS, SaaS), leveraging elastic resources to handle traffic spikes and releasing them afterward, achieving cost‑effective scalability.
IaaS : Infrastructure as a Service – dynamic hardware resources.
PaaS : Platform as a Service – ready‑made technical components.
SaaS : Software as a Service – fully developed applications.
Although the article covers many scaling solutions, topics such as cross‑region data synchronization and distributed transactions are omitted for future discussion.
4. Architecture Design Summary
Must the architecture follow this exact evolution? No; real scenarios may require addressing multiple issues simultaneously or following a different order.
How detailed should the design be? For a one‑off project, meet the performance targets; for continuously growing systems, design for future growth and iterate.
Difference between service‑side and big‑data architecture? Big‑data architecture focuses on data collection, storage, processing, and analysis, while service‑side architecture deals with application organization, often relying on big‑data components.
Design principles
N+1 design – no single point of failure.
Rollback capability.
Feature toggle for quick disable.
Built‑in monitoring.
Multi‑active data centers for high availability.
Use mature, well‑supported technologies.
Resource isolation.
Horizontal scalability.
Buy non‑core components.
Enterprise‑grade hardware.
Rapid iteration of small features.
Stateless service interfaces.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
