How Did Taobao’s Backend Architecture Evolve to Support Millions of Users?
Using Taobao as a case study, this article traces the step‑by‑step evolution of a high‑traffic backend—from a single‑machine setup through database sharding, caching, load balancing, microservices, containerization, and finally cloud deployment—highlighting the technologies and design principles that enable scaling to millions of concurrent users.
Overall Architecture Evolution Path
Single‑machine architecture
First evolution: Separate Tomcat and database deployment
Second evolution: Introduce local and distributed caching
Third evolution: Add reverse proxy for load balancing
Fourth evolution: Database read/write separation
Fifth evolution: Business‑based database sharding
Sixth evolution: Split large tables into smaller ones
Seventh evolution: Use LVS or F5 for multi‑Nginx load balancing
Eighth evolution: DNS round‑robin for inter‑data‑center load balancing
Ninth evolution: Introduce NoSQL databases and search engines
Tenth evolution: Split a large application into smaller applications
Eleventh evolution: Extract reusable functions into microservices
Twelfth evolution: Introduce Enterprise Service Bus (ESB) to mask service interface differences
Thirteenth evolution: Adopt containerization for environment isolation and dynamic service management
Fourteenth evolution: Host the system on a cloud platform
1. Overview
This article uses Taobao as an example to illustrate the evolution of a server‑side architecture from hundreds to tens of millions of concurrent users, listing the relevant technologies at each stage and summarizing architectural design principles at the end.
Note: The example is for illustration only and does not reflect Taobao’s actual technical evolution.
2. Basic Concepts
Before discussing the architecture, the following fundamental concepts are introduced:
Distributed: Multiple modules deployed on different servers, e.g., Tomcat and database on separate machines.
High Availability: The system continues to provide service when some nodes fail.
Cluster: A group of servers providing a unified service, with automatic failover.
Load Balancing: Distributing requests evenly across multiple nodes.
Forward and Reverse Proxy: Forward proxy handles outbound requests from internal systems; reverse proxy forwards inbound external requests to internal servers.
3. Architecture Evolution
3.1 Single‑machine Architecture
Initially, Tomcat and the database are deployed on the same server. As user numbers grow, resource contention between Tomcat and the database becomes a bottleneck.
3.2 First Evolution: Separate Tomcat and Database
Tomcat and the database each occupy dedicated servers, significantly improving performance. However, concurrent database reads/writes soon become the new bottleneck.
3.3 Second Evolution: Local and Distributed Caching
Introduce local cache (e.g., memcached) and distributed cache (e.g., Redis) to store hot product data or HTML pages, intercepting most requests before they hit the database. Issues such as cache consistency, penetration, breakdown, and hot‑data expiration are addressed.
3.4 Third Evolution: Reverse Proxy for Load Balancing
Deploy multiple Tomcat instances and use a reverse‑proxy (Nginx or HAProxy) to distribute requests. This greatly increases the supported concurrency, but the database soon becomes the limiting factor.
3.5 Fourth Evolution: Database Read/Write Separation
Separate the database into read and write instances; multiple read replicas synchronize from the primary write database. Middleware such as Mycat manages read/write routing and sharding.
3.6 Fifth Evolution: Business‑Based Database Sharding
Store different business data in separate databases to reduce contention. This introduces cross‑business query challenges.
3.7 Sixth Evolution: Split Large Tables into Small Tables
Hash‑based routing splits tables (e.g., comments by product ID, payments by hour). Horizontal scaling is achieved, but operational complexity rises. MPP databases such as Greenplum, TiDB, and PostgreSQL‑XC are introduced.
3.8 Seventh Evolution: LVS/F5 for Multi‑Nginx Load Balancing
LVS (software) or F5 (hardware) operate at layer 4, providing higher performance and protocol support than Nginx. High availability is achieved with keepalived and virtual IPs.
3.9 Eighth Evolution: DNS Round‑Robin for Inter‑Data‑Center Load Balancing
Configure a domain to resolve to multiple IPs, each pointing to a different data‑center, achieving geographic load balancing and horizontal scaling to tens of millions of concurrent users.
3.10 Ninth Evolution: Introduce NoSQL and Search Engines
Adopt HDFS for file storage, HBase/Redis for key‑value, ElasticSearch for full‑text search, and Kylin/Druid for multidimensional analysis, increasing system complexity and consistency challenges.
3.11 Tenth Evolution: Split Large Application into Smaller Applications
Divide code by business modules, allowing independent upgrades. Shared configuration can be managed via Zookeeper.
3.12 Eleventh Evolution: Extract Reusable Functions as Microservices
Common functionalities (user management, order, payment, authentication) become independent services accessed via HTTP, TCP, or RPC. Frameworks such as Dubbo or Spring Cloud provide governance, rate limiting, circuit breaking, and degradation.
3.13 Twelfth Evolution: Enterprise Service Bus (ESB) for Interface Unification
ESB abstracts protocol differences, allowing applications to access backend services uniformly and reducing coupling, similar to SOA architecture.
3.14 Thirteenth Evolution: Containerization (Docker & Kubernetes)
Package services as Docker images and deploy them dynamically with Kubernetes, simplifying scaling and resource isolation.
3.15 Fourteenth Evolution: Cloud Platform Hosting
Deploy the system on public cloud (IaaS, PaaS, SaaS) to leverage elastic resources, reduce operational costs, and achieve on‑demand scaling.
4. Architecture Design Summary
Must the architecture follow the exact evolution path? No; real scenarios may require addressing multiple bottlenecks simultaneously.
How detailed should the design be? Design to meet current performance targets while leaving room for future expansion.
Difference between backend and big‑data architectures? Big‑data architecture focuses on data collection, storage, and analysis (e.g., HDFS, Spark), whereas backend architecture concerns application organization and service delivery.
Design Principles
N+1 design: No single point of failure.
Rollback design: Ensure forward compatibility and version rollback.
Feature toggle design: Ability to disable features quickly.
Monitoring design: Incorporate observability from the start.
Multi‑active data‑center design for high availability.
Adopt mature technologies to avoid hidden bugs.
Resource isolation to prevent one business from monopolizing resources.
Horizontal scalability as a core requirement.
Buy non‑core components when development cost is high.
Use commercial hardware for reliability.
Rapid iteration: Deploy small features quickly for validation.
Stateless service design.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
