Big Data 17 min read

Mastering Hadoop: From MapReduce Basics to Taobao’s Massive Data Architecture

This article introduces the fundamental MapReduce model and Hadoop framework, explains their components such as HDFS, MapReduce, and HBase, and then examines Taobao’s large‑scale data product architecture—including storage, computation, query, and caching layers—to illustrate practical big‑data processing techniques.

21CTO

Dec 9, 2015

Mastering Hadoop: From MapReduce Basics to Taobao’s Massive Data Architecture

We start with the most basic MapReduce pattern and the Hadoop framework, then expand to massive data processing and finally discuss Taobao's large‑scale data product technical architecture, aiming for both simplicity and depth.

Part 1: In‑depth MapReduce Pattern and Hadoop Framework

Architecture Overview

To understand the article, readers should first grasp three points:

MapReduce is a pattern.

Hadoop is a framework.

Hadoop is an open‑source distributed parallel programming framework that implements the MapReduce pattern.

Thus, the core idea is to process massive data on the Hadoop framework using the MapReduce model.

MapReduce Pattern

MapReduce is a core cloud‑computing model, a distributed programming technique that splits a problem into a Map (mapping) phase and a Reduce (aggregation) phase.

The workflow is illustrated below:

Data is split, mapped to intermediate key/value pairs by the Map function, then shuffled, sorted, and reduced to final results.

The MapReduce implementation follows functional programming ideas: a Map function transforms input key/value pairs into intermediate pairs, and a Reduce function merges values with the same intermediate key.

MapReduce leverages data locality, non‑shared architecture, replication, and fault tolerance to achieve efficient large‑scale processing.

During the Map phase, intermediate data is first buffered in memory, partially sorted, and only later written to disk. The Reduce phase proceeds through Copy → Sort → Reduce, using merge sort.

Hadoop Framework

Hadoop is an open‑source distributed parallel programming framework that implements the MapReduce model. It provides a distributed file system (HDFS) and a distributed database (HBase).

In short:

Hadoop = HDFS (storage) + HBase (database) + MapReduce (processing)

Hadoop integrates storage and processing to handle massive data sets.

Hadoop Components

Hadoop consists of HDFS, MapReduce, and HBase.

HDFS is the open‑source implementation of Google’s GFS, providing a master/slave architecture with a NameNode and multiple DataNodes.

Hadoop MapReduce offers a simple software framework that runs on thousands of commodity machines, handling TB‑scale data sets with fault‑tolerant parallel execution.

Hive is a data‑warehouse tool on Hadoop that translates SQL‑like queries into MapReduce jobs.

HBase is a distributed, column‑oriented NoSQL database modeled after Google’s BigTable.

Part 2: Taobao’s Massive Data Product Architecture

Taobao’s architecture is divided into five layers from top to bottom: Data Source, Compute, Storage, Query, and Product.

Data Source layer: transaction data from shops, transferred in near real‑time via DataX, DbSync, and Timetunnel.

Compute layer: a Hadoop cluster (called “cloud ladder”) runs daily MapReduce jobs.

Storage layer: uses MyFOX (distributed MySQL) and Prom (HBase‑based NoSQL cluster).

Query layer: Glider provides RESTful HTTP APIs; MyFOX also handles queries.

Product layer: final data products.

MyFOX

MyFOX is a query‑proxy layer for a distributed MySQL cluster (MyISAM engine). It stores hot and cold data on different disks to balance performance and cost.

Prom

Prom’s storage structure and query process are illustrated below.

Glider Technical Architecture

Glider acts as an intermediate layer that performs joins/unions across heterogeneous tables, isolates front‑end products from back‑end storage, and provides unified query services.

Caching

Glider implements a two‑level cache: a second‑level cache per data source and a first‑level cache for integrated requests. It also respects cache‑control directives in URLs and HTTP headers.

To mitigate cache‑penetration, Taobao uses short‑lived caching for empty results and employs Bloom filters to filter non‑existent keys. To avoid cache‑snow‑avalanche on expiration, cache lifetimes are staggered.

Original source: http://blog.csdn.net/v_july_v/article/details/6704077

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems MapReduce Hadoop Data Architecture

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.