Inside Taobao’s Massive Data Architecture: How 1.5 PB Daily Is Processed and Served
The article explains Taobao’s five‑layer data product architecture—covering data sources, compute, storage, query, and product layers—and describes how massive volumes of data are ingested, processed in batch and streaming, stored in MySQL and HBase clusters, and served efficiently through a unified middle‑layer and sophisticated caching mechanisms.
During the recent "Double 11" shopping festival, Taobao’s data product architecture attracted attention; the system processes massive data volumes using a five‑layer design: data source, compute, storage, query, and product layers.
The data source layer gathers user, shop, item, transaction databases and behavior logs from the main site.
Real‑time data from the source layer is transferred via Alibaba‑developed components DataX, DbSync, and Timetunnel to a 1,500‑node Hadoop cluster called "Yunti" (the compute layer), where about 40,000 daily MapReduce jobs process 1.5 PB of raw data.
For latency‑sensitive data such as search‑term statistics, a streaming platform named "Galaxy" receives real‑time messages from Timetunnel, performs in‑memory calculations, and writes results to NoSQL stores for fast front‑end access.
To serve front‑end products, a storage layer was built using a distributed MySQL cluster (MyFOX) and an HBase‑based NoSQL cluster (Prom). MyFOX stores aggregated statistics (over 10 TB, 95 % of total data) across 20 MySQL nodes, with hot and cold nodes differentiated by storage media to balance cost and performance.
Prom uses HBase as its underlying store, keeping raw transaction details as row‑keys composed of attribute‑value pairs, and provides on‑the‑fly aggregation by performing local calculations on each node before merging results.
A middle‑layer called "glider" offers a RESTful HTTP interface that abstracts heterogeneous storage modules, performs JOIN/UNION operations across them, and manages a two‑level cache (per‑datasource and per‑request) to improve latency.
Caching is treated as a systematic engineering effort: glider’s caches, MyFOX’s shard‑level caches, and TTL propagation ensure data consistency, while techniques such as Bloom filters and short‑TTL empty‑result caching mitigate cache‑penetration and avalanche effects.
Overall, the architecture supports up to 80 TB of compressed storage, handles 40 million daily queries with an average response time of 28 ms, and demonstrates how a layered, cache‑aware design can sustain rapid growth in data volume and traffic.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
