Inside Taobao’s Massive Data Architecture: From Hadoop “Cloud Ladder” to Real‑Time “Galaxy”
This article details Taobao’s multi‑layer massive data platform, covering its five‑tier architecture, the 1500‑node Hadoop “Cloud Ladder” for batch processing, the low‑latency “Galaxy” stream engine, MySQL‑based MyFOX, HBase‑based Prom storage, the glider middle‑layer, and sophisticated caching strategies that together support petabytes of data and millions of daily queries.
The most popular topic during the double‑11 shopping festival is Taobao (TB). A recent discussion with an Alibaba colleague revealed many interesting aspects of Taobao’s technical architecture, which are summarized below.
Five‑Layer Architecture
Taobao’s massive data product is divided into five layers based on data flow: data source, computation, storage, query, and product layers.
Data Source Layer
Raw data originates from the main site’s user, shop, item, transaction databases and behavioral logs (browsing, search, etc.). Real‑time data is transferred via Alibaba‑developed components DataX, DbSync, and Timetunnel to a 1500‑node Hadoop cluster called “Cloud Ladder”.
Computation Layer – Cloud Ladder
Approximately 40,000 daily MapReduce jobs process 1.5 PB of raw data on Cloud Ladder, typically completing before 2 AM. Results are often intermediate, balancing redundancy and front‑end computation.
Real‑Time Stream Layer – Galaxy
For latency‑critical data (e.g., search‑term statistics), a separate stream platform called “Galaxy” receives real‑time messages from Timetunnel, performs in‑memory calculations, and quickly writes results to NoSQL stores for front‑end consumption.
Storage Layer
Two main clusters serve the front‑end: MyFOX, a distributed MySQL‑based relational cluster, and Prom, an HBase‑based NoSQL cluster. MyFOX stores over 10 TB of statistical results (≈95 % of total data) across 20 nodes, with hot and cold nodes differentiated by storage media (SAS vs. SATA) to balance performance and cost.
NoSQL Complement – Prom
Prom addresses use cases where relational databases struggle, such as full‑attribute selection. Raw transaction data is stored in HBase with attribute‑value pairs as row keys, using two column families (index and data). Queries are executed locally on each node, and results are aggregated globally.
Middle‑Layer – Glider
Glider provides a unified RESTful HTTP interface that abstracts heterogeneous storage modules, performing JOIN/UNION operations across MyFOX, Prom, and external services, and managing a two‑level cache (per‑datasource and per‑request).
Caching Strategy
Glider’s caching system leverages the read‑only nature of data within defined time windows. It employs multi‑level caches, cache‑control commands via URL query strings and HTTP headers, and mechanisms such as Bloom filters and short‑TTL empty‑result caching to mitigate cache penetration and avalanche effects.
Performance & Outlook
The platform currently compresses 80 TB of data, handles 40 million daily queries with an average response time of 28 ms (as of June 1), and continues to evolve to meet growing traffic and data volume.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
