Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions
This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.
In large‑scale data processing, a massive number of tiny files stored on HDFS can severely degrade cluster scalability and performance. Each file’s metadata consumes about 150 bytes in the NameNode’s memory, and frequent metadata lookups increase read/write latency, while Spark drivers suffer from excessive task and partition metadata.
When using Spark SQL to write data, small files are produced mainly by two factors: (1) the source data already contains many small files, and (2) dynamic partition inserts without proper shuffling. In the worst case, the number of generated files equals the product of Spark tasks (M) and partitions (N), potentially reaching millions of files for a 1 TB TPC‑DS store_sales dataset.
use tpcds_1t_parquet;
INSERT overwrite table store_sales partition (ss_sold_date_sk)
SELECT * FROM tpcds_1t_ext.et_store_sales;The physical plan shows a large number of HiveTableScan and InsertIntoHiveTable operations, confirming the explosion of file count. Adjusting spark.sql.shuffle.partitions influences the number of output files: a high value creates many small files, while a low value reduces parallelism.
To mitigate the issue, the article recommends several strategies:
Shuffle data by the partition column (e.g., distribute by ss_sold_date_sk ) so each partition is written by a single task, producing roughly N files.
Combine partition‑based distribution with a random bucket (e.g., cast(rand()*5 as int) ) to balance file count and data skew.
Separate skewed keys (such as null partition values) into distinct INSERT statements and apply random bucketing only to the skewed subset.
Enable Spark Adaptive Query Execution (AQE) with spark.sql.adaptive.enabled=true and optionally set spark.sql.adaptive.shuffle.targetPostShuffleInputSize to align with HDFS block size.
use tpcds_1t_parquet;
INSERT overwrite table store_sales partition (ss_sold_date_sk)
SELECT * FROM tpcds_1t_ext.et_store_sales
DISIBUTE BY ss_sold_date_sk, CAST(rand()*5 AS int);Experimental results on the NetEase YouShu big‑data platform show that using the combined distribution reduces the total file count dramatically (e.g., from millions to ~1,800 files) while keeping data skew manageable.
In summary, controlling small files in pure‑SQL Spark workloads involves careful partition‑key selection, optional random bucketing, and leveraging adaptive execution features to let Spark coalesce output files automatically.
Author Bio: Yan Qing, senior engineer at NetEase, Apache Spark contributor (SQL/Core), founder of Kyuubi and spark‑authorizer projects, and Apache Submarine committer focusing on machine‑learning infrastructure.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.