Big Data 15 min read

How Taobao Scales Massive Data Products: Architecture Behind 1.5PB Daily Processing

This article explains how Taobao processes over 1.5 PB of daily data through a five‑layer architecture, combining batch Hadoop jobs, a streaming platform, distributed MySQL and HBase storage, and a unified caching middle layer to deliver fast, scalable data services.

21CTO

Jan 6, 2016

How Taobao Scales Massive Data Products: Architecture Behind 1.5PB Daily Processing

Taobao holds the most commercially valuable massive data in China, with billions of daily shop and product views, millions of transactions, and extensive user behavior logs. The goal is to turn this data into business value for merchants and rational decisions for shoppers.

Taobao Massive Data Product Architecture

Data products are effectively read‑only within a time window, which enables a robust caching strategy.

The architecture is divided into five layers: data source, compute, storage, query, and product. The data source layer aggregates user, shop, item, transaction databases and behavior logs.

Real‑time logs are transmitted via Taobao‑developed components DataX, DbSync and Timetunnel to a 1,500‑node Hadoop cluster called “Yunti” (cloud ladder). Approximately 40,000 MapReduce jobs process 1.5 PB of raw data each day, typically finishing before 2 am.

For latency‑sensitive data such as search‑term statistics, a streaming platform named “Galaxy” consumes Timetunnel messages, performs in‑memory calculations, and writes results quickly to NoSQL stores for front‑end consumption.

Storage Layer: Relational and NoSQL

MyFOX is a distributed MySQL (MyISAM) cluster that acts as a transparent query proxy. It stores over 10 TB of statistical results—about 95 % of the data cube—distributed across 20 nodes. Hot nodes use 15,000 rpm SAS disks, while cold nodes use 7,500 rpm SATA disks to balance performance and cost.

Prom uses HBase on HDFS as its storage engine. Raw transaction data are stored with attribute‑value pairs as row‑keys, and two column families (index and data) enable fast local computation before aggregation.

Glider: Data Middle Layer

Glider provides a RESTful HTTP interface that joins heterogeneous tables (MyFOX, Prom, etc.) and isolates front‑end products from back‑end storage, handling data JOIN and UNION operations centrally.

Cache Management

Glider implements two‑level caching: a second‑level cache per datasource and a first‑level cache per request. Cache‑control commands travel with HTTP requests (query string or If‑None‑Match header) and propagate down to each storage module, which returns its TTL; Glider’s TTL is the minimum of these values.

To mitigate cache penetration, a Bloom filter blocks requests for definitely absent keys, and empty results are cached with a short TTL (max five minutes). To avoid cache‑avalanche on expiration, Glider staggers TTLs across clients, reducing simultaneous load spikes.

Conclusion

The described architecture currently supports 80 TB of compressed storage, handles 40 million daily queries with an average response time of 28 ms, and meets near‑term growth needs, though challenges such as short‑connection HTTP overhead and evolving traffic patterns remain.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Caching

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.