Big Data 16 min read

Inside Taobao’s Massive Data Architecture: From Hadoop “Cloud Ladder” to Real‑Time “Galaxy”

This article details Taobao’s multi‑layer massive data platform, covering its five‑tier architecture, the 1500‑node Hadoop “Cloud Ladder” for batch processing, the low‑latency “Galaxy” stream engine, MySQL‑based MyFOX, HBase‑based Prom storage, the glider middle‑layer, and sophisticated caching strategies that together support petabytes of data and millions of daily queries.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Inside Taobao’s Massive Data Architecture: From Hadoop “Cloud Ladder” to Real‑Time “Galaxy”

The most popular topic during the double‑11 shopping festival is Taobao (TB). A recent discussion with an Alibaba colleague revealed many interesting aspects of Taobao’s technical architecture, which are summarized below.

Five‑Layer Architecture

Taobao’s massive data product is divided into five layers based on data flow: data source, computation, storage, query, and product layers.

Data Source Layer

Raw data originates from the main site’s user, shop, item, transaction databases and behavioral logs (browsing, search, etc.). Real‑time data is transferred via Alibaba‑developed components DataX, DbSync, and Timetunnel to a 1500‑node Hadoop cluster called “Cloud Ladder”.

Computation Layer – Cloud Ladder

Approximately 40,000 daily MapReduce jobs process 1.5 PB of raw data on Cloud Ladder, typically completing before 2 AM. Results are often intermediate, balancing redundancy and front‑end computation.

Real‑Time Stream Layer – Galaxy

For latency‑critical data (e.g., search‑term statistics), a separate stream platform called “Galaxy” receives real‑time messages from Timetunnel, performs in‑memory calculations, and quickly writes results to NoSQL stores for front‑end consumption.

Storage Layer

Two main clusters serve the front‑end: MyFOX, a distributed MySQL‑based relational cluster, and Prom, an HBase‑based NoSQL cluster. MyFOX stores over 10 TB of statistical results (≈95 % of total data) across 20 nodes, with hot and cold nodes differentiated by storage media (SAS vs. SATA) to balance performance and cost.

NoSQL Complement – Prom

Prom addresses use cases where relational databases struggle, such as full‑attribute selection. Raw transaction data is stored in HBase with attribute‑value pairs as row keys, using two column families (index and data). Queries are executed locally on each node, and results are aggregated globally.

Middle‑Layer – Glider

Glider provides a unified RESTful HTTP interface that abstracts heterogeneous storage modules, performing JOIN/UNION operations across MyFOX, Prom, and external services, and managing a two‑level cache (per‑datasource and per‑request).

Caching Strategy

Glider’s caching system leverages the read‑only nature of data within defined time windows. It employs multi‑level caches, cache‑control commands via URL query strings and HTTP headers, and mechanisms such as Bloom filters and short‑TTL empty‑result caching to mitigate cache penetration and avalanche effects.

Performance & Outlook

The platform currently compresses 80 TB of data, handles 40 million daily queries with an average response time of 28 ms (as of June 1), and continues to evolve to meet growing traffic and data volume.

Taobao massive data product architecture diagram
Taobao massive data product architecture diagram
MyFOX data growth curve
MyFOX data growth curve
MyFOX query process
MyFOX query process
MyFOX node structure
MyFOX node structure
Prom storage structure
Prom storage structure
Prom query process
Prom query process
Glider architecture diagram
Glider architecture diagram
Cache control system
Cache control system
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsBig DatacachingmysqlHBaseHadoop
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.