How Taobao Scaled from 100 to Millions of Users: Backend Evolution
Using Taobao as a case study, this article traces the architectural evolution from a single-server setup handling hundreds of requests to a multi-layered, distributed system capable of supporting millions of concurrent users, detailing each stage’s challenges, technologies such as caching, load balancing, microservices, and cloud deployment.
1. Overview
This article uses Taobao as an example to illustrate the evolution of server‑side architecture from handling a hundred concurrent requests to supporting tens of millions, listing relevant technologies at each stage and summarizing architectural design principles.
2. Basic Concepts
Before discussing architecture, basic concepts are introduced:
Distributed : Multiple modules deployed on different servers, e.g., Tomcat and database on separate machines.
High Availability : System continues to provide service when some nodes fail.
Cluster : A group of servers offering a unified service, with automatic failover.
Load Balancing : Distributing requests evenly across nodes.
Forward and Reverse Proxy : Forward proxy handles outbound requests from internal systems; reverse proxy forwards inbound external requests to internal servers.
3. Architecture Evolution
3.1 Single‑machine Architecture
Initially, Tomcat and the database are deployed on the same server. Users access www.taobao.com, DNS resolves to an IP, and the browser contacts the Tomcat on that server.
As user numbers grow, Tomcat and the database compete for resources, and a single machine cannot meet performance needs.
3.2 First Evolution: Separate Tomcat and Database
Tomcat and the database are placed on separate servers, significantly improving performance of each.
With increasing users, concurrent reads/writes to the database become a bottleneck.
3.3 Second Evolution: Introduce Local and Distributed Caches
Local cache is added within Tomcat/JVM, and a distributed cache (e.g., Redis) is added externally to store hot product data or HTML pages, intercepting most requests before they hit the database.
Technologies include Memcached for local cache, Redis for distributed cache, and concerns such as cache consistency, penetration, breakdown, avalanche, and hot‑data expiration.
Cache handles most traffic, but as users increase, load shifts to Tomcat, slowing responses.
3.4 Third Evolution: Reverse Proxy for Load Balancing
Multiple Tomcat instances are deployed, and a reverse‑proxy like Nginx distributes requests evenly. Assuming each Tomcat handles 100 concurrent connections and Nginx 50,000, Nginx can forward traffic to 500 Tomcat instances.
Technologies: Nginx, HAProxy, session sharing, file upload/download.
Reverse proxy greatly increases application concurrency, but the database becomes the new bottleneck.
3.5 Fourth Evolution: Database Read‑Write Separation
Database is split into a write master and multiple read replicas, synchronized to keep data consistent. For the latest data, an extra copy can be kept in cache.
Technology: Mycat middleware for read/write splitting and sharding.
Business traffic varies widely; different services compete for the same database, affecting performance.
3.6 Fifth Evolution: Business‑Level Sharding
Different business data are stored in separate databases, reducing resource contention. High‑traffic businesses can have more servers. Cross‑business joins require additional solutions.
As users grow, the single write database eventually hits performance limits.
3.7 Sixth Evolution: Splitting Large Tables
Large tables are partitioned into smaller ones, e.g., hashing comments by product ID or creating hourly tables for payment records, routing data accordingly.
This enables horizontal scaling of the database, but increases DBA workload and leads to distributed database (MPP) architectures.
Popular MPP databases: Greenplum, TiDB, PostgreSQL‑XC, HAWQ, GBase, SnowballDB, LibrA, each with different focuses (OLTP vs OLAP).
Both database and Tomcat can scale horizontally, but eventually Nginx becomes the bottleneck.
3.8 Seventh Evolution: LVS or F5 for Multi‑Nginx Load Balancing
When Nginx is a bottleneck, layer‑4 load balancers like LVS (software) or F5 (hardware) distribute traffic among multiple Nginx instances, supporting hundreds of thousands of concurrent connections.
LVS can be made highly available with keepalived and virtual IPs.
Even with multiple Nginx, as concurrency reaches hundreds of thousands, the load balancer itself becomes a bottleneck, and geographic latency emerges.
3.9 Eighth Evolution: DNS Round‑Robin Across Data Centers
DNS is configured with multiple IPs, each pointing to a different data center. Clients receive one IP via round‑robin, achieving data‑center‑level load balancing.
As data volume and business complexity increase, the database alone cannot satisfy retrieval and analysis needs.
3.10 Ninth Evolution: Introduce NoSQL and Search Engines
When data scale exceeds relational database capabilities, technologies such as HDFS, HBase, MongoDB, Redis, ElasticSearch, Kylin, and Druid are introduced for storage, key‑value access, full‑text search, and multidimensional analysis.
Adding more components raises system complexity, requiring synchronization, consistency, and operational management.
3.11 Tenth Evolution: Split Monolithic Application into Smaller Services
Code is divided by business domain, allowing independent deployment and scaling. Shared configuration can be managed by Zookeeper.
Shared modules duplicated across applications increase maintenance effort during upgrades.
3.12 Eleventh Evolution: Extract Reusable Functions as Microservices
Common functionalities (user management, order, payment, authentication) are extracted into independent services accessed via HTTP, TCP, or RPC. Frameworks like Dubbo and Spring Cloud provide governance, rate limiting, circuit breaking, and degradation.
Different services may use different access protocols, leading to complex call chains.
3.13 Twelfth Evolution: Enterprise Service Bus (ESB) to Hide Interface Differences
ESB performs protocol conversion, allowing applications and services to communicate uniformly, reducing coupling. This resembles SOA, which can be confused with microservices.
Growing numbers of applications and services complicate deployment and environment conflicts; dynamic scaling for large events becomes operationally heavy.
3.14 Thirteenth Evolution: Containerization for Isolation and Dynamic Management
Docker packages applications/services as images; Kubernetes orchestrates deployment, scaling, and resource isolation.
During peak periods, containers can be launched on reserved machines; after the event, they are stopped, leaving other services unaffected.
Containers solve dynamic scaling but still require on‑premise hardware, leading to idle resources and high costs.
3.15 Fourteenth Evolution: Cloud Platform as the Host
Deploying to public cloud provides elastic resources; services can be provisioned on demand during traffic spikes and released afterward, achieving pay‑as‑you‑go and reducing operational costs.
Cloud layers:
IaaS – infrastructure as a service, providing virtualized hardware.
PaaS – platform as a service, offering common components (e.g., Hadoop, MPP databases).
SaaS – software as a service, delivering ready‑made applications.
4. Architecture Design Summary
Architecture adjustments need not follow a strict sequence; multiple issues may be addressed simultaneously based on actual bottlenecks.
For a system with clear performance targets, design enough to meet those targets while leaving room for future expansion. For continuously evolving platforms like e‑commerce, design for the next growth stage and iterate.
Service‑side architecture differs from big‑data architecture: the former focuses on application organization, while the latter provides underlying storage, computation, and analysis capabilities.
Key design principles include:
N+1 design – no single point of failure.
Rollback capability.
Feature toggle for quick disablement.
Built‑in monitoring.
Multi‑active data centers for high availability.
Use mature, supported technologies.
Resource isolation to prevent one business from monopolizing resources.
Horizontal scalability.
Purchase non‑core solutions when appropriate.
Commercial hardware for reliability.
Rapid iteration of small features.
Stateless service design.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Interview Crash Guide
Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
