Why Small Files Are a Problem in Big Data and How Delta Lake Compaction Solves It
This article examines the root causes and performance impact of massive small-file proliferation in traditional data warehouses, explains why HDFS metadata limits scalability, and details how Delta Lake’s custom compaction process can safely merge these files for append-only tables without disrupting reads or writes.
The article discusses the long‑standing issue of numerous small files in traditional data warehouses and the reasons they arise, such as fine‑grained partitioning, continuous incremental writes, and multiple storage states during updates.
It explains why a large number of files becomes a bottleneck, primarily because HDFS stores file metadata in the memory of a single node, leading to scalability limits despite later federation attempts.
The author describes how Delta Lake can address the small‑file problem through compaction, noting that while Delta currently lacks built‑in compaction, its extensible API allows custom implementations that avoid affecting ongoing reads and writes.
A specific Delta‑based compaction implementation is outlined, with the constraint that tables containing update/delete (upsert) operations cannot be compacted; only append‑only tables are safe because their existing files never change.
The compaction workflow consists of reading data up to a target version, physically deleting files marked for removal, merging added files per partition into larger files, and finally acquiring a transaction and committing the changes.
The author, William Zhu, a senior data architect with 11 years of R&D experience, concludes with a friendly request for likes, shares, and follows.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
