Architecture of Taobao's Massive Data Products: From Data Sources to the Glider Middleware
The article details Taobao's massive data product architecture, describing a five‑layer system that processes billions of daily records using Hadoop, real‑time streams, distributed MySQL and HBase clusters, and a middleware layer called Glider that unifies queries, caching, and front‑end integration.
Taobao operates one of the largest domestic e‑commerce data platforms, handling over 30 billion shop and product view records daily, 1 billion online items, and tens of millions of transactions, reviews, and favorites, aiming to extract commercial value for merchants and rational decisions for consumers.
To meet this demand, Taobao has built data products such as Quantum Statistics, Data Cube, and Taobao Index; however, the sheer scale turns computation, storage, and retrieval into significant challenges, which the article illustrates using the Data Cube as an example.
Taobao Massive Data Product Technical Architecture
Data products are primarily write‑once and read‑many; within a given time window the data can be treated as read‑only, which simplifies cache design.
The system is divided into five layers: data source, compute, storage, query, and product. The data source layer gathers user, shop, product, transaction databases and behavior logs from the main Taobao site.
Real‑time data from the source layer is transferred via Taobao‑developed components DataX, DbSync, and Timetunnel to a 1 500‑node Hadoop cluster called “Cloud Ladder”, which performs roughly 40 000 MapReduce jobs on 1.5 PB of raw data each day, typically completing before 2 am. Results are often intermediate, balancing redundancy and front‑end computation.
For latency‑sensitive data, such as search‑term statistics, Taobao built a real‑time streaming platform named “Galaxy”. Galaxy receives messages from Timetunnel, processes them in‑memory, and writes results quickly to NoSQL stores for front‑end consumption.
Because neither Cloud Ladder nor Galaxy is suited for direct real‑time queries, a dedicated storage layer was created, consisting of a distributed MySQL cluster (MyFOX) and an HBase‑based NoSQL cluster (Prom). A middleware layer called Glider provides a unified RESTful interface that abstracts the heterogeneous storage back‑ends.
Relational Databases Remain King
Relational DBMSs have been dominant since the 1970s; Taobao uses MySQL’s MyISAM engine and builds a distributed query‑proxy layer (MyFOX) to make sharding transparent to front‑end applications.
MyFOX stores over 10 TB of statistical results—about 95 % of the Data Cube’s data—and grows by more than 600 million rows daily, distributed across 20 MySQL nodes. Hot and cold nodes are distinguished: hot nodes use 15 000 rpm SAS disks (≈4.5 W/TB) for frequently accessed recent data, while cold nodes use 7 500 rpm SATA disks (≈1.6 W/TB) for older data, reducing memory‑to‑disk ratios.
NoSQL Is a Beneficial Complement to SQL
When faced with “full‑attribute selector” queries (e.g., filtering laptops by size, positioning, disk capacity, Bluetooth), traditional relational databases struggle due to sparse attribute distributions. Taobao therefore introduced Prometheus (Prom), built on HBase, which stores raw transaction details with attribute‑pair row keys and two column families (index and data). Fixed‑length fields enable fast offset‑based lookups, avoiding costly random reads.
Prom performs local computation on each node and returns partial results; the final answer is a simple aggregation of these partial results, leveraging parallelism while minimizing data transfer.
Use Middleware to Isolate Front‑End and Back‑End
MyFOX and Prom solve storage and query needs, but heterogeneous sources still require joining data (e.g., combining a hot‑sale product ID from MyFOX with product details from the main Taobao site). Glider acts as a middle layer that joins and unions data from different back‑ends, exposing a unified RESTful API.
Caching Is a Systematic Engineering Effort
Glider implements two‑level caching: a second‑level cache per heterogeneous data source and a first‑level cache for integrated requests. MyFOX caches per‑shard rather than aggregated results to improve hit rates and reduce redundancy.
Cache control commands (query‑string parameters and HTTP If‑None‑Match headers) propagate through all layers, with each storage module returning its own TTL; Glider reports the minimum TTL to the client.
To mitigate cache‑penetration, Taobao uses a simple approach: empty results are cached for a short period (max five minutes). To alleviate cache‑avalanche on expiration, Glider staggers TTLs across clients, distributing load over time.
Conclusion
With the described architecture, the Data Cube now offers 80 TB of compressed storage, Glider handles 40 million daily queries with an average latency of 28 ms (as of June 1), and the system is prepared for near‑future growth, though challenges such as short‑lived HTTP connections and high TCP connection counts remain.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
