Handling Non‑Splittable gzip Files in Hadoop and Spark: MapReduce Splits and Performance Considerations
This article explains how a 10 GB gzip file is stored and processed on HDFS, details the MapReduce split calculation using GzipCodec, and discusses why Spark reads such non‑splittable files with a single task, recommending file splitting or format conversion for better performance.
When a 10 GB gzip file is stored on HDFS, it remains non‑splittable, so it occupies ceil(10 GB/128 MB) blocks on a single DataNode with the default replication factor.
MapReduce reads such a file using GzipCodec; the number of map tasks equals the number of input splits, which are calculated by the FileInputFormat logic. The split calculation involves determining minSize, maxSize, and splitSize based on block size, then creating splits while the remaining bytes divided by splitSize exceed the SPLIT_SLOP threshold (1.1). Remaining bytes form a final split.
public List<InputSplit> getSplits(JobContext job) throws IOException {
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
List<InputSplit> splits = new ArrayList<>();
List<FileStatus> files = listStatus(job);
for (FileStatus file : files) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
if ((length != 0) && isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining) / splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length - bytesRemaining);
splits.add(new FileSplit(path, length - bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
splits.add(new FileSplit(path, length - bytesRemaining, bytesRemaining,
blkLocations[blkLocations.length - 1].getHosts()));
}
} else if (length != 0) {
splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
} else {
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
LOG.debug("Total # of splits: " + splits.size());
return splits;
}The helper method to compute split size is:
protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}If a record spans two blocks, the first map task discards the incomplete record, letting the next task handle it; map tasks may read ahead a few extra blocks to ensure completeness.
In Spark, reading a non‑splittable gzip file degrades to a single task per file, causing potential performance bottlenecks. The recommended approach is to split the data into smaller files or use a splittable compression format, or repartition after reading (though repartition adds overhead).
Therefore, gzip files are best pre‑processed into smaller chunks or converted to a splittable format before processing with Spark or Hadoop.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
