Tagged articles

13 articles

Page 1 of 1

Nov 3, 2025 · Big Data

Taming Small Files in Paimon: Proven Tuning Strategies for Better Performance

This article explains how small‑file issues in Paimon's streaming data lake architecture degrade system stability and query speed, and presents practical parameter‑tuning, table‑level settings, asynchronous compaction, and monitoring techniques to mitigate those problems.

Big DataData LakeFlink

0 likes · 7 min read

Taming Small Files in Paimon: Proven Tuning Strategies for Better Performance

Qunar Tech Salon

Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataHiveMapReduce

0 likes · 23 min read

Understanding and Solving Small File Problems in Hive and Spark

Data Thinking Notes

May 10, 2023 · Big Data

Mastering Hive Small File Management: Strategies to Boost Performance

This article explains why tiny Hive files degrade storage and query efficiency, outlines how they are created, and presents practical Spark and Hive configuration techniques—including dynamic partitioning, AQE, Reduce tuning, and automated daily merge jobs—to effectively consolidate small files and improve overall data‑warehouse performance.

HiveSmall FilesSpark

0 likes · 10 min read

Mastering Hive Small File Management: Strategies to Boost Performance

Big Data Technology & Architecture

May 5, 2023 · Big Data

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big DataHiveSmall Files

0 likes · 9 min read

Strategies for Handling Small Files in Hive and Spark

Big Data Technology & Architecture

Mar 4, 2022 · Big Data

Managing Small Files in Apache Hudi and Spark Optimization Guide

The article explains how Apache Hudi automatically manages file sizes to avoid small‑file issues, details key configuration parameters, provides a step‑by‑step example of file merging, and offers practical Spark tuning recommendations for optimal performance in data‑lake workloads.

Apache HudiBig DataData Lake

0 likes · 11 min read

Managing Small Files in Apache Hudi and Spark Optimization Guide

Big Data Technology & Architecture

Jul 19, 2021 · Big Data

Understanding Hadoop: MapReduce, HDFS, YARN, and Core Big Data Concepts

This article provides a comprehensive overview of Hadoop’s core components—including MapReduce programming model, HDFS storage architecture, and YARN resource management—while discussing common challenges like data skew and small files, and offering learning resources for aspiring big‑data engineers.

Data SkewHDFSHadoop

0 likes · 9 min read

Understanding Hadoop: MapReduce, HDFS, YARN, and Core Big Data Concepts

Big Data Technology Architecture

Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopSmall Files

0 likes · 12 min read

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

Big Data Technology & Architecture

Dec 9, 2020 · Big Data

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

The article explains why Hive tables generate many small files on HDFS, describes the performance impact on NameNode and MapReduce, and provides detailed configuration steps and compression techniques—including input and output file merging, various Hive file formats, and partition optimization—to efficiently manage storage and resource consumption in big‑data environments.

HadoopHiveSmall Files

0 likes · 19 min read

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

dbaplus Community

Feb 25, 2020 · Backend Development

How to Merge Small Files in Flink Checkpoints to Reduce HDFS Load

This article explains a small‑file‑merging technique for Apache Flink checkpoints that reuses FSDataOutputStreams to combine multiple state files into a single HDFS file, detailing design considerations such as concurrent checkpoint support, reference‑counted deletion, space amplification reduction, fault handling, compatibility, and observed production performance gains.

Apache FlinkCheckpointHDFS

0 likes · 13 min read

How to Merge Small Files in Flink Checkpoints to Reduce HDFS Load

Big Data Technology & Architecture

Jan 7, 2020 · Big Data

Why Small Files Are a Problem in Big Data and How Delta Lake Compaction Solves It

This article examines the root causes and performance impact of massive small-file proliferation in traditional data warehouses, explains why HDFS metadata limits scalability, and details how Delta Lake’s custom compaction process can safely merge these files for append-only tables without disrupting reads or writes.

Delta LakeHDFSSmall Files

0 likes · 5 min read

Why Small Files Are a Problem in Big Data and How Delta Lake Compaction Solves It

360 Quality & Efficiency

Jun 11, 2019 · Backend Development

NebulasFs: A Distributed High‑Availability Small‑File Storage System Developed by 360 Infrastructure Team

NebulasFs is a self‑developed distributed, highly available, and persistent storage system designed to efficiently store billions of small files, offering simple RESTful APIs, automatic request routing, multi‑tenant isolation, customizable replication, automated scaling, rebalancing, and fault‑tolerant replica recovery for large‑scale unstructured data workloads.

NebulasFsSmall Filescloud

0 likes · 8 min read

NebulasFs: A Distributed High‑Availability Small‑File Storage System Developed by 360 Infrastructure Team

360 Tech Engineering

Jun 11, 2019 · Databases

NebulasFs: A Distributed High‑Availability Small‑File Storage System

NebulasFs is a self‑developed distributed, highly available and durable storage system designed to efficiently store billions of small files by using a master‑datanode architecture, multi‑tenant isolation, customizable replication, automatic scaling, and automated replica repair, addressing the challenges of massive unstructured data generated by modern applications.

Cloud NativeNebulasFsReplication

0 likes · 7 min read

NebulasFs: A Distributed High‑Availability Small‑File Storage System

Big Data Technology & Architecture

Apr 12, 2019 · Big Data

Weekly Knowledge Summary: Yarn Resource Scheduler, Hadoop Rack Awareness, HDFS Data Flow, and Small File Solutions

This weekly note shares personal updates and a concise technical overview covering Yarn's resource scheduling, Hadoop's rack‑aware architecture, HDFS data flow, and practical solutions to the HDFS small‑file problem, along with links to further reading and upcoming work plans.

Big DataHDFSHadoop

0 likes · 5 min read

Weekly Knowledge Summary: Yarn Resource Scheduler, Hadoop Rack Awareness, HDFS Data Flow, and Small File Solutions