Big Data 24 min read

Understanding Hadoop Data Splitting and InputFormat Mechanisms

This article explains Hadoop's data splitting concepts, the distinction between HDFS blocks and logical InputSplits, details the source code of various InputFormats such as TextInputFormat, CombineTextInputFormat, KeyValueTextInputFormat, NLineInputFormat, and custom InputFormats, and provides complete Java examples for Mapper, Reducer, and driver classes.

Big Data Technology & Architecture

Sep 1, 2021

Understanding Hadoop Data Splitting and InputFormat Mechanisms

Hadoop stores large files on HDFS as multiple physical blocks, while logical InputSplits determine the number of parallel Map tasks.

The article explains the difference between HDFS blocks and InputSplits, showing that a 512 MB file with a default 128 MB block size results in four blocks, but can be split into five logical splits of 100 MB each, leading to five Map tasks.

It provides the source code of the getSplits method from FileInputFormat, illustrating how minimum and maximum split sizes are calculated, how block locations are retrieved, and how splits are generated.

Various built‑in InputFormats are described:

TextInputFormat : default 128 MB split size, reads lines as (offset, line) key‑value pairs.

CombineTextInputFormat : merges many small files into larger logical splits to reduce the number of Map tasks.

KeyValueTextInputFormat : splits lines by a configurable delimiter to produce key‑value pairs.

NLineInputFormat : creates splits based on a fixed number of lines per split.

For each InputFormat, the article includes complete Java examples of Mapper, Reducer, and driver classes, e.g., the WordCount example for TextInputFormat:

package com.lzj.hadoop.input;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    Text k = new Text();
    IntWritable v = new IntWritable(1);
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        if (line.isEmpty()) return;
        for (String word : line.split(" ")) {
            k.set(word);
            context.write(k, v);
        }
    }
}

Similar code snippets are provided for CombineTextInputFormat, KeyValueTextInputFormat, and NLineInputFormat, demonstrating how to configure the job, set the input format class, and specify split parameters.

The article also covers creating a custom InputFormat by implementing a non‑splittable CustomFileInputFormat and a corresponding CustomRecordReader, followed by Mapper, Reducer, and driver classes that read whole files as BytesWritable values.

Finally, it discusses Hadoop serialization by defining a custom User bean that implements Writable, and shows Mapper, Reducer, and driver code that aggregates user wages, illustrating how custom writable objects can be transferred between Map and Reduce phases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java MapReduce Hadoop Data Splitting InputFormat

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.