Mastering Hadoop: From MapReduce Basics to Taobao’s Massive Data Architecture
This article introduces the fundamental MapReduce model and Hadoop framework, explains their components such as HDFS, MapReduce, and HBase, and then examines Taobao’s large‑scale data product architecture—including storage, computation, query, and caching layers—to illustrate practical big‑data processing techniques.
We start with the most basic MapReduce pattern and the Hadoop framework, then expand to massive data processing and finally discuss Taobao's large‑scale data product technical architecture, aiming for both simplicity and depth.
Part 1: In‑depth MapReduce Pattern and Hadoop Framework
Architecture Overview
To understand the article, readers should first grasp three points:
MapReduce is a pattern.
Hadoop is a framework.
Hadoop is an open‑source distributed parallel programming framework that implements the MapReduce pattern.
Thus, the core idea is to process massive data on the Hadoop framework using the MapReduce model.
MapReduce Pattern
MapReduce is a core cloud‑computing model, a distributed programming technique that splits a problem into a Map (mapping) phase and a Reduce (aggregation) phase.
The workflow is illustrated below:
Data is split, mapped to intermediate key/value pairs by the Map function, then shuffled, sorted, and reduced to final results.
The MapReduce implementation follows functional programming ideas: a Map function transforms input key/value pairs into intermediate pairs, and a Reduce function merges values with the same intermediate key.
MapReduce leverages data locality, non‑shared architecture, replication, and fault tolerance to achieve efficient large‑scale processing.
During the Map phase, intermediate data is first buffered in memory, partially sorted, and only later written to disk. The Reduce phase proceeds through Copy → Sort → Reduce, using merge sort.
Hadoop Framework
Hadoop is an open‑source distributed parallel programming framework that implements the MapReduce model. It provides a distributed file system (HDFS) and a distributed database (HBase).
In short:
Hadoop = HDFS (storage) + HBase (database) + MapReduce (processing)
Hadoop integrates storage and processing to handle massive data sets.
Hadoop Components
Hadoop consists of HDFS, MapReduce, and HBase.
HDFS is the open‑source implementation of Google’s GFS, providing a master/slave architecture with a NameNode and multiple DataNodes.
Hadoop MapReduce offers a simple software framework that runs on thousands of commodity machines, handling TB‑scale data sets with fault‑tolerant parallel execution.
Hive is a data‑warehouse tool on Hadoop that translates SQL‑like queries into MapReduce jobs.
HBase is a distributed, column‑oriented NoSQL database modeled after Google’s BigTable.
Part 2: Taobao’s Massive Data Product Architecture
Taobao’s architecture is divided into five layers from top to bottom: Data Source, Compute, Storage, Query, and Product.
Data Source layer: transaction data from shops, transferred in near real‑time via DataX, DbSync, and Timetunnel.
Compute layer: a Hadoop cluster (called “cloud ladder”) runs daily MapReduce jobs.
Storage layer: uses MyFOX (distributed MySQL) and Prom (HBase‑based NoSQL cluster).
Query layer: Glider provides RESTful HTTP APIs; MyFOX also handles queries.
Product layer: final data products.
MyFOX
MyFOX is a query‑proxy layer for a distributed MySQL cluster (MyISAM engine). It stores hot and cold data on different disks to balance performance and cost.
Prom
Prom’s storage structure and query process are illustrated below.
Glider Technical Architecture
Glider acts as an intermediate layer that performs joins/unions across heterogeneous tables, isolates front‑end products from back‑end storage, and provides unified query services.
Caching
Glider implements a two‑level cache: a second‑level cache per data source and a first‑level cache for integrated requests. It also respects cache‑control directives in URLs and HTTP headers.
To mitigate cache‑penetration, Taobao uses short‑lived caching for empty results and employs Bloom filters to filter non‑existent keys. To avoid cache‑snow‑avalanche on expiration, cache lifetimes are staggered.
Original source: http://blog.csdn.net/v_july_v/article/details/6704077
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
