How Taobao Scales Massive Data Products: Architecture Insights from Data Cube
This article explores Taobao's massive data product architecture, detailing its five-layer design, the use of Hadoop and real‑time systems, hybrid relational and NoSQL storage, a middleware layer for data integration, and systematic caching strategies that enable petabyte‑scale analytics and fast query responses.
Taobao operates one of the largest commercial data platforms in China, handling billions of daily shop and product view records, tens of billions of online items, and massive transaction, collection, and review data. The goal is to extract commercial value to support data‑driven operations for merchants and rational shopping decisions for consumers.
Taobao Massive Data Product Architecture
Data products are primarily write‑once, read‑many, allowing the system to be treated as read‑only over certain periods, which simplifies cache design.
The architecture is divided into five layers: data source, compute, storage, query, and product. The source layer includes user, shop, product, transaction databases, and behavior logs from the main Taobao site.
Real‑time data from the source layer is transferred via Taobao‑developed components DataX, DbSync, and Timetunnel to a 1,500‑node Hadoop cluster called “Cloud Ladder”, the core of the compute layer. Approximately 40,000 MapReduce jobs process 1.5 PB of raw data daily, completing before 2 am.
For latency‑sensitive data such as search‑term statistics, a real‑time streaming platform named “Galaxy” processes messages from Timetunnel in memory and writes results to NoSQL stores for fast front‑end access.
Relational Databases Still King
Taobao uses MySQL with the MyISAM engine as the foundation for its relational storage. To handle massive scale, a distributed MySQL query proxy layer called MyFOX provides transparent sharding across 20 nodes, storing over 10 TB (95% of Data Cube’s data) and growing by more than 600 million rows daily.
Nodes are classified as hot or cold: hot nodes store recent, frequently accessed data on 15 k RPM SAS disks, while cold nodes store older data on 7.5 k RPM SATA disks, optimizing storage cost and memory‑to‑disk ratios.
NoSQL is a Beneficial Complement to SQL
When full‑attribute queries become impractical for relational databases, Taobao introduced Prometheus (Prom), a NoSQL service built on HBase. Raw transaction data is stored with attribute‑value pairs as row keys, using two column families for index and data. Fixed‑length fields enable fast offset‑based lookups, reducing random disk I/O.
Prom performs local computation on each node and aggregates results globally, supporting not only SUM but a range of statistical operations while minimizing data transfer.
Using Middleware to Isolate Front and Back Ends
To hide heterogeneity among storage modules, Taobao built a middle‑layer service called glider, exposing a unified RESTful API. glider performs JOIN and UNION operations across disparate “tables”, consolidating data from MyFOX, Prom, and external APIs before delivering it to front‑end products.
Caching is a Systematic Engineering
Glider implements a two‑level cache: a secondary cache per heterogeneous table and a primary cache for integrated requests. Cache entries include a TTL; glider returns the minimum TTL among all sources, propagating it to the client via HTTP headers.
To mitigate cache penetration, empty results are cached with a short TTL (max five minutes). To alleviate cache‑avalanche effects, expiration times are staggered across clients, and short‑connection HTTP communication is used, though it can cause high TCP connection counts under peak load.
Conclusion
With the described architecture, Data Cube now stores compressed data equivalent to 80 TB, handles 40 million daily queries with an average response time of 28 ms (as of June 1), and meets near‑future growth demands. Ongoing challenges include optimizing inter‑layer communication and adapting the architecture as data volume and traffic evolve.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
