Big Data 28 min read

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

This comprehensive guide walks you through MapReduce fundamentals, the complete execution flow, and seven hands‑on Hadoop projects—including WordCount, custom serialization, custom partitioning, grouping comparators, file merging, multiple outputs, join operations, and friend‑graph analysis—while providing environment setup steps, Maven commands, and Hadoop CLI examples.

dbaplus Community

Jun 7, 2017

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

This article provides a fast‑track introduction to MapReduce, covering its basic principles, execution process, core workflow, and practical development methods through a series of seven detailed examples.

1. MapReduce Basics

MapReduce is a programming model for distributed processing of large data sets. It consists of two core operations: map , where each worker processes a chunk of input data and emits intermediate (key, value) pairs, and reduce , where the framework groups values by key and aggregates them.

A library‑style illustration uses a library inventory scenario: ten students each count books on a shelf (map), then the librarian sums the counts (reduce).

2. Development Environment

Two options are offered: build your own Hadoop cluster or use a pre‑packaged virtual machine (Hadoop 2.7.3) that runs on VirtualBox with Vagrant. Essential commands include:

vagrant box add {name} {path_to_box}
cd d:\hdfstest
vagrant init hadoop
vagrant up

After the VM starts, launch HDFS and YARN:

start-dfs.sh
start-yarn.sh

3. Example 1 – WordCount

The classic WordCount program demonstrates a complete MapReduce job.

Create a Maven project with pom.xml (shown as an image).

Define WordcountMapper (extends Mapper<LongWritable, Text, Text, IntWritable>) to split each line into words and emit (word, 1).

Define WordCountReducer (extends Reducer<Text, IntWritable, Text, IntWritable>) to sum the counts for each word.

Assemble the job in WordCountMapReduce and submit it with hadoop jar ....

Compilation command: mvn package Run the job:

hadoop jar mapreduce-wordcount-0.0.1-SNAPSHOT.jar WordCountMapReduce /wordcount/input /wordcount/output

4. Example 2 – Custom Object Serialization

Goal: aggregate mobile‑user traffic logs by phone number.

Define a custom FlowBean (serializable) holding up‑flow, down‑flow, and total flow.

Mapper emits (phone, FlowBean); reducer merges beans for the same phone.

Build, package, and run similarly with Maven and Hadoop commands (see source images for exact code).

5. Example 3 – Custom Partitioner

Goal: route records to reducers based on phone‑number prefix (province).

Implement ProvincePartitioner that extracts the prefix and looks it up in a hard‑coded map.

Set the partitioner in the job via job.setPartitionerClass(ProvincePartitioner.class).

6. Example 4 – Grouping Comparator (Maximum Order Amount)

Goal: find the highest‑value transaction per order.

Create OrderBean (implements WritableComparable) with orderId and amount.

Custom ItemIdPartitioner ensures all records of the same order go to the same reducer.

Custom MyGroupingComparator groups by orderId while sorting by amount descending.

7. Example 5 – Merging Small Files

Large numbers of tiny files cause overhead. A custom MyInputFormat and MyRecordReader read whole files as a single record, outputting (filename, fileContent). The job writes results with SequenceFileOutputFormat to keep objects intact.

8. Example 6 – MultipleOutputs (One File per Key)

Goal: write each order’s records to a separate file named after the order ID.

Mapper emits (orderId, line).

Reducer uses MultipleOutputs to create files like Order_0000001-r-00000.

9. Example 7 – Join Operation

Goal: perform an inner join between an order table and a product table.

Define InfoBean with a flag indicating source (order or product).

Mapper tags each record with the flag and emits (productId, InfoBean).

Reducer receives all beans for a productId, separates order and product beans, and enriches orders with product details.

10. Example 8 – Common Friends (Two‑Stage MapReduce)

First job emits (friend, user) pairs, then groups by friend to list all users sharing that friend, finally emitting (userA‑userB, friend). The second job aggregates friends per user pair to produce the final common‑friend list.

11. Core MapReduce Workflow Recap

The overall process consists of:

Client submits a job; the JobTracker (YARN) creates a split plan.

AppMaster is launched to coordinate map and reduce tasks.

Map tasks read input splits, invoke the user map method, and write sorted, partitioned intermediate data to local spill files.

After map completion, reducers fetch their assigned partitions, merge and sort the data, apply the GroupingComparator, and invoke the user reduce method.

Reducer output is written to the final HDFS destination.

Figures illustrating each step are included as

tags.

12. Conclusion

The tutorial equips readers with a solid understanding of MapReduce theory and hands‑on experience building real Hadoop jobs, from simple word counts to complex joins and graph analyses.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java CLI maven MapReduce Distributed Computing Hadoop

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.