Big Data Technology & Architecture
Sep 23, 2021 · Big Data
Handling Non‑Splittable gzip Files in Hadoop and Spark: MapReduce Splits and Performance Considerations
This article explains how a 10 GB gzip file is stored and processed on HDFS, details the MapReduce split calculation using GzipCodec, and discusses why Spark reads such non‑splittable files with a single task, recommending file splitting or format conversion for better performance.
Data SplitsGzipHadoop
0 likes · 8 min read
