From Single Server to Cloud Native: How Taobao Scaled to Millions of Concurrent Users
This article uses Taobao as a case study to trace the evolution of a high‑performance backend architecture from a single‑machine setup to a cloud‑native, micro‑service ecosystem, highlighting the technical challenges and design principles at each scaling stage.
1. Overview
This article takes Taobao as an example to illustrate the evolution of server‑side architecture from a hundred concurrent requests to tens of millions, listing the technologies encountered at each stage and summarizing key design principles at the end.
2. Basic Concepts
Before discussing architecture, the following fundamental concepts are introduced:
Distributed : Multiple modules deployed on different servers, e.g., Tomcat and database on separate machines.
High Availability : When some nodes fail, others take over to continue providing service.
Cluster : A group of servers that together provide a service, such as Zookeeper's master‑slave nodes.
Load Balancing : Requests are evenly distributed across multiple nodes.
Forward and Reverse Proxy : Forward proxy acts on behalf of internal systems to access external networks; reverse proxy forwards external requests to internal servers.
3. Architecture Evolution
Single‑Machine Architecture
In the early days, Tomcat and the database were deployed on the same server. A browser request to www.taobao.com first resolves the domain via DNS to an IP (e.g., 10.102.4.1) and then reaches the Tomcat instance.
Architecture bottleneck: As user count grows, Tomcat and the database compete for resources, and a single machine cannot sustain the load.
First Evolution: Separate Tomcat and Database
Tomcat and the database each occupy their own server, significantly improving performance of both.
Architecture bottleneck: Database read/write becomes the new limiting factor as concurrency rises.
Second Evolution: Introduce Local and Distributed Caches
Local cache (e.g., memcached) is added within Tomcat/JVM, and a distributed cache (Redis) is deployed externally to store hot product data or HTML pages. This intercepts most requests before they hit the database, reducing pressure dramatically.
Architecture bottleneck: Cache handles most traffic, but the remaining load stresses the single Tomcat, causing response latency.
Third Evolution: Reverse Proxy for Load Balancing
Multiple Tomcat instances are deployed and a reverse‑proxy (Nginx or HAProxy) distributes requests evenly. Assuming each Tomcat handles 100 concurrent connections and Nginx 50,000, the system can theoretically support 50,000 concurrent users.
Architecture bottleneck: While application servers scale, the database becomes the next limiting factor.
Fourth Evolution: Database Read/Write Separation
The database is split into a write master and multiple read replicas. Tools such as Mycat provide middleware for read/write separation and sharding, with synchronization ensuring consistency.
Architecture bottleneck: Different business modules compete for database resources, affecting performance.
Fifth Evolution: Business‑Level Database Sharding
Data for each business line is stored in separate databases, reducing contention. Cross‑business queries require additional solutions, which are beyond the scope of this article.
Architecture bottleneck: The single write database eventually reaches its performance ceiling.
Sixth Evolution: Split Large Tables into Small Tables
Tables are hashed or time‑partitioned, routing rows to many small tables across multiple servers. This enables horizontal scaling of the database. MPP (Massively Parallel Processing) databases such as Greenplum, TiDB, PostgreSQL‑XC, and commercial solutions like GBase or LibrA provide the necessary capabilities.
Architecture bottleneck: After both application servers and databases scale horizontally, the Nginx layer becomes the next limiting factor.
Seventh Evolution: LVS/F5 for Multi‑Nginx Load Balancing
LVS (software) or F5 (hardware) operates at layer 4, offering higher throughput than Nginx. Keepalived can provide virtual IP failover for high availability.
Architecture bottleneck: A single LVS instance eventually caps at hundreds of thousands of concurrent connections, and geographic latency becomes noticeable.
Eighth Evolution: DNS Round‑Robin Across Data Centers
Multiple IPs are associated with a domain; DNS returns different IPs (each pointing to a different data center) using round‑robin or other policies, achieving data‑center‑level load balancing.
Architecture bottleneck: Richer data and business demands eventually outgrow pure relational databases.
Ninth Evolution: Introduce NoSQL and Search Engines
For massive data, solutions such as HDFS, HBase, Redis, Elasticsearch, Kylin, or Druid are adopted to handle key‑value storage, full‑text search, and multidimensional analytics.
Architecture bottleneck: Adding many components increases system complexity and operational overhead.
Tenth Evolution: Split Large Application into Smaller Services
Applications are divided by business domain, allowing independent development and deployment. Shared configuration can be managed via Zookeeper.
Architecture bottleneck: Duplicate code across applications makes coordinated upgrades difficult.
Eleventh Evolution: Extract Common Functions as Micro‑services
Functions such as user management, order processing, and authentication become independent services accessed via HTTP, TCP, or RPC. Frameworks like Dubbo or Spring Cloud provide service governance, rate limiting, circuit breaking, etc.
Architecture bottleneck: Diverse access protocols and inter‑service calls increase coupling and complexity.
Twelfth Evolution: Enterprise Service Bus (ESB) for Unified Access
ESB abstracts protocol conversion, allowing applications to call backend services uniformly and reducing coupling, similar to SOA architecture.
Architecture bottleneck: Rapid growth of services and components makes deployment and scaling increasingly difficult.
Thirteenth Evolution: Containerization
Docker packages applications into images; Kubernetes orchestrates dynamic deployment, scaling, and resource isolation, simplifying operations especially during traffic spikes.
Architecture bottleneck: Even with containers, the underlying hardware must still be provisioned, leading to under‑utilized resources outside peak periods.
Fourteenth Evolution: Cloud Platform Adoption
The system is migrated to a public cloud, leveraging IaaS for elastic compute, PaaS for common components, and SaaS for ready‑made services, achieving on‑demand resource allocation and cost efficiency.
4. Architecture Design Summary
Architecture adjustments need not follow a strict linear path; multiple bottlenecks may be addressed simultaneously.
Design depth should match system goals: a fixed‑scope project needs only enough architecture to meet performance targets, while a continuously evolving platform should anticipate future growth.
Service‑side architecture differs from big‑data architecture: the former focuses on application organization, the latter on data processing pipelines.
Key design principles include N+1 redundancy, rollback capability, feature toggles, built‑in monitoring, multi‑active data centers, mature technology adoption, horizontal scalability, purchasing non‑core components, using commercial hardware, rapid iteration, and stateless service interfaces.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
