How Taobao Scaled: 14 Evolution Steps of a Massive Backend Architecture
This article walks through the step‑by‑step evolution of a large‑scale e‑commerce backend—from a single‑server setup to microservices, containerization, and cloud platforms—highlighting the technical challenges, key technologies, and design principles that enable millions of concurrent users.
Overview
This article uses an e‑commerce site (illustrated with Taobao) to demonstrate how a web service can evolve from a single‑machine deployment serving a few hundred users to a multi‑data‑center architecture handling tens of millions of concurrent requests. Each evolution step addresses a specific bottleneck and introduces the relevant technologies.
Basic Concepts
Distributed : Deploying components on different physical machines (e.g., separating the application server and the database).
High Availability (HA) : The system continues to serve requests when one or more nodes fail.
Cluster : A group of servers that provide a single logical service and can replace each other on failure.
Load Balancing : Evenly distributing incoming requests across multiple nodes.
Forward/Reverse Proxy : A forward proxy forwards outbound traffic on behalf of internal services; a reverse proxy (e.g., Nginx, HAProxy) receives inbound traffic and forwards it to internal servers.
Architecture Evolution
1. Single‑Machine Architecture
Initially the application server (Tomcat) and the database run on the same host. DNS resolves the domain to a single IP address. This setup quickly reaches resource contention as traffic grows.
2. Separate Application Server and Database
Deploy Tomcat and the database on separate machines. This eliminates CPU, memory, and I/O competition between the two, improving performance and simplifying scaling of each tier.
3. Introduce Local and Distributed Caches
Add an in‑process cache (e.g., ConcurrentHashMap or Ehcache) inside each Tomcat instance and a distributed cache such as Redis or Memcached. Cache hot product data and rendered HTML to intercept the majority of read requests before they hit the database. Important considerations include cache consistency, cache penetration, cache avalanche, and hot‑spot expiration.
4. Reverse Proxy for Load Balancing
Deploy multiple Tomcat instances behind a reverse proxy (Nginx or HAProxy). The proxy distributes requests at layer 7, allowing horizontal scaling of the application tier. Typical capacity assumptions: a single Tomcat can handle ~100 concurrent connections; Nginx can handle ~50 000. Session replication or sticky sessions may be required for stateful interactions.
5. Database Read/Write Splitting
Introduce a middleware (e.g., Mycat) to separate write operations to a primary database and read operations to one‑or‑many replicas. Writes are routed to the master; reads are load‑balanced across replicas. This reduces read latency and increases overall throughput. Ensure data consistency for read‑after‑write scenarios, possibly by reading from the master or using cache‑based write‑through.
6. Business‑Level Database Sharding
Group related tables into separate logical databases based on business domains (e.g., order, user, inventory). This reduces cross‑domain contention and allows independent scaling of each shard. Cross‑domain queries become more complex and may require data‑warehouse solutions or service‑level joins.
7. Split Large Tables into Small Tables
Apply hash‑based or time‑based partitioning to massive tables (e.g., comments by product ID, payment logs by hour). Each partition is stored in its own table, enabling horizontal scaling of the storage layer. Tools such as Mycat can route queries to the correct partition; modern MPP databases (Greenplum, TiDB, PostgreSQL‑XC, HAWQ) provide built‑in support for massive parallel processing and automatic partition management.
8. Layer‑4 Load Balancing (LVS/F5)
Introduce a layer‑4 load balancer (Linux Virtual Server or commercial F5) to distribute traffic among multiple Nginx instances. LVS operates in the kernel, supporting TCP/UDP forwarding with very high throughput (hundreds of thousands of connections). Use keepalived to provide virtual IP failover for HA.
9. DNS Round‑Robin Across Data Centers
Configure DNS to return multiple IP addresses, each pointing to a different data‑center. Clients receive one IP per request (or based on geographic routing), achieving inter‑site load balancing and enabling horizontal scaling to the tens‑of‑millions of users level.
10. Adopt NoSQL and Search Engines
When relational databases become a bottleneck for complex analytics or unstructured data, integrate specialized stores:
HDFS for massive file storage.
HBase or Cassandra for wide‑column key‑value access.
MongoDB for flexible document models.
Elasticsearch for full‑text search.
Kylin or Druid for OLAP and multidimensional analysis.
These components introduce additional consistency and operational complexity.
11. Split Monolithic Application into Smaller Services
Divide the codebase by business modules (e.g., user, order, payment) into independent applications. Use a distributed configuration service such as Zookeeper to share runtime configuration across services.
12. Extract Common Functions as Microservices
Identify reusable capabilities (authentication, user management, payment) and expose them as independent services accessed via HTTP, gRPC, or RPC. Frameworks like Dubbo or Spring Cloud provide service governance, rate limiting, circuit breaking, and fallback mechanisms.
13. Introduce an Enterprise Service Bus (ESB)
Deploy an ESB to perform protocol conversion and unified routing, reducing coupling between services. This pattern resembles SOA: services remain independent, but the ESB abstracts communication details.
14. Containerization
Package each service as a Docker image and orchestrate with Kubernetes. Containers provide isolated runtime environments, enable rapid scaling, and simplify deployment pipelines.
15. Move to Public Cloud
Leverage IaaS/PaaS offerings to obtain elastic compute, managed storage (e.g., managed Hadoop, managed MPP databases), and pay‑as‑you‑go pricing. Cloud resources can be provisioned for traffic spikes (e.g., large promotions) and released afterward, dramatically improving resource utilization and reducing operational overhead.
Architecture Design Summary
Evolution steps are not linear; address the most pressing bottleneck first.
Design for the current performance targets while leaving room for future growth.
Service‑side architecture focuses on request handling and HA; big‑data architecture provides storage and analytical capabilities that service layers consume.
Key design principles:
N+1 redundancy : No single point of failure.
Rollback capability : Ability to revert to a previous version.
Feature toggles : Configurable enable/disable of functionality.
Built‑in monitoring : Metrics, tracing, and alerting from the start.
Multi‑active data centers : Geographic redundancy for high availability.
Mature technology adoption : Prefer proven, well‑supported components.
Resource isolation : Prevent one business from monopolizing CPU, memory, or I/O.
Horizontal scalability : Design all tiers to scale out by adding nodes.
Non‑core components as commercial products : Reduce development effort.
Commercial‑grade hardware : Improves reliability.
Rapid iteration : Small, incremental releases.
Stateless services : Avoid session affinity where possible.
Code Example (Original)
最近面试BAT,整理一份面试资料《Java面试BAT通关手册》,覆盖了Java核心技术、JVM、Java并发、SSM、微服务、数据库、数据结构等等。
获取方式:点“在看”,关注公众号并回复 手册 领取,更多内容陆续奉上。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
