Big Data 8 min read

Handling Non‑Splittable gzip Files in Hadoop and Spark: MapReduce Splits and Performance Considerations

This article explains how a 10 GB gzip file is stored and processed on HDFS, details the MapReduce split calculation using GzipCodec, and discusses why Spark reads such non‑splittable files with a single task, recommending file splitting or format conversion for better performance.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Handling Non‑Splittable gzip Files in Hadoop and Spark: MapReduce Splits and Performance Considerations

When a 10 GB gzip file is stored on HDFS, it remains non‑splittable, so it occupies ceil(10 GB/128 MB) blocks on a single DataNode with the default replication factor.

MapReduce reads such a file using GzipCodec; the number of map tasks equals the number of input splits, which are calculated by the FileInputFormat logic. The split calculation involves determining minSize, maxSize, and splitSize based on block size, then creating splits while the remaining bytes divided by splitSize exceed the SPLIT_SLOP threshold (1.1). Remaining bytes form a final split.

public List<InputSplit> getSplits(JobContext job) throws IOException {
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    long maxSize = getMaxSplitSize(job);
    List<InputSplit> splits = new ArrayList<>();
    List<FileStatus> files = listStatus(job);
    for (FileStatus file : files) {
        Path path = file.getPath();
        FileSystem fs = path.getFileSystem(job.getConfiguration());
        long length = file.getLen();
        BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
        if ((length != 0) && isSplitable(job, path)) {
            long blockSize = file.getBlockSize();
            long splitSize = computeSplitSize(blockSize, minSize, maxSize);
            long bytesRemaining = length;
            while (((double) bytesRemaining) / splitSize > SPLIT_SLOP) {
                int blkIndex = getBlockIndex(blkLocations, length - bytesRemaining);
                splits.add(new FileSplit(path, length - bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts()));
                bytesRemaining -= splitSize;
            }
            if (bytesRemaining != 0) {
                splits.add(new FileSplit(path, length - bytesRemaining, bytesRemaining,
                        blkLocations[blkLocations.length - 1].getHosts()));
            }
        } else if (length != 0) {
            splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
        } else {
            splits.add(new FileSplit(path, 0, length, new String[0]));
        }
    }
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    LOG.debug("Total # of splits: " + splits.size());
    return splits;
}

The helper method to compute split size is:

protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
}

If a record spans two blocks, the first map task discards the incomplete record, letting the next task handle it; map tasks may read ahead a few extra blocks to ensure completeness.

In Spark, reading a non‑splittable gzip file degrades to a single task per file, causing potential performance bottlenecks. The recommended approach is to split the data into smaller files or use a splittable compression format, or repartition after reading (though repartition adds overhead).

Therefore, gzip files are best pre‑processed into smaller chunks or converted to a splittable format before processing with Spark or Hadoop.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

GzipMapReduceSparkHadoopData Splits
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.