Big Data 16 min read

Mastering Massive Data: MapReduce, Hadoop, and Taobao’s Architecture

This article introduces the fundamental MapReduce model and Hadoop framework, explains their roles in large‑scale data processing, and then examines Taobao’s massive‑data product architecture—including its data source, compute, storage, query, and product layers, as well as the MyFOX, Prom, and Glider components and caching strategies.

21CTO
21CTO
21CTO
Mastering Massive Data: MapReduce, Hadoop, and Taobao’s Architecture

The article starts with a brief introduction, stating that it will cover the basic MapReduce model, the Hadoop framework, massive data processing, and finally Taobao’s massive‑data product technical architecture, aiming for both accessibility and depth.

Part 1: MapReduce Model and Hadoop Framework

To understand the following content, readers should know three points:

MapReduce is a programming model.

Hadoop is a framework.

Hadoop implements the MapReduce model as an open‑source distributed parallel programming framework.

MapReduce is a core cloud‑computing pattern that splits a problem into a map phase and a reduce phase. The diagram below illustrates this flow.

During the map phase, input data are divided and processed in parallel across a cluster; the intermediate results are then shuffled, sorted, and combined in the reduce phase to produce the final output.

Hadoop provides a distributed file system (HDFS) and a distributed database (HBase). In simple terms:

Hadoop = HDFS (storage) + HBase (database) + MapReduce (processing).

HDFS follows a master/slave architecture with a single NameNode managing the namespace and multiple DataNodes storing blocks of data.

MapReduce jobs split input into independent chunks, process them in parallel (map), sort the intermediate results, and then feed them to reducers.

Hive, built on Hadoop, offers a SQL‑like interface that translates queries into MapReduce jobs, while HBase provides a column‑oriented NoSQL store compatible with Google’s BigTable.

Part 2: Taobao’s Massive Data Product Architecture

Taobao’s data‑product platform is organized into five layers from top to bottom: data source, compute, storage, query, and product.

Data source layer : transaction data from shops, transferred in near‑real‑time via DataX, DbSync, and Timetunnel.

Compute layer : a Hadoop cluster (referred to as “cloud ladder”) that runs daily MapReduce jobs.

Storage layer : two systems – MyFOX (a distributed MySQL cluster) and Prom (an HBase‑based NoSQL cluster).

Query layer : the Glider service exposing RESTful HTTP APIs; queries are ultimately served by MyFOX.

Product layer : the front‑end data products.

MyFOX uses MySQL’s MyISAM engine and adds a distributed query‑proxy layer. The data‑query flow is shown below.

Each MyFOX node stores hot data (frequently accessed) on high‑speed SAS disks and cold data (less accessed) on larger SATA disks.

Prom’s storage structure and query process are illustrated below.

Glider acts as an intermediate layer that performs joins and unions across heterogeneous tables, isolates front‑end products from back‑end storage, and provides a unified query service.

Glider also implements a two‑level caching system: a second‑level cache per data source and a first‑level cache for integrated requests. Cache control commands travel from the client request down to the underlying storage modules.

To mitigate cache‑penetration, Taobao caches empty results for a short period (up to five minutes). To reduce cache‑avalanche on expiration, the system staggers TTLs and may use locking or queuing to limit concurrent miss traffic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsTaobaoMapReduceHadoopData Architecture
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.