Tagged articles
13 articles
Page 1 of 1
Qunar Tech Salon
Qunar Tech Salon
Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataHiveMapReduce
0 likes · 23 min read
Understanding and Solving Small File Problems in Hive and Spark
Data Thinking Notes
Data Thinking Notes
May 10, 2023 · Big Data

Mastering Hive Small File Management: Strategies to Boost Performance

This article explains why tiny Hive files degrade storage and query efficiency, outlines how they are created, and presents practical Spark and Hive configuration techniques—including dynamic partitioning, AQE, Reduce tuning, and automated daily merge jobs—to effectively consolidate small files and improve overall data‑warehouse performance.

HiveSmall FilesSpark
0 likes · 10 min read
Mastering Hive Small File Management: Strategies to Boost Performance
Big Data Technology & Architecture
Big Data Technology & Architecture
May 5, 2023 · Big Data

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big DataHiveSmall Files
0 likes · 9 min read
Strategies for Handling Small Files in Hive and Spark
Big Data Technology Architecture
Big Data Technology Architecture
Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopSmall Files
0 likes · 12 min read
Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 9, 2020 · Big Data

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

The article explains why Hive tables generate many small files on HDFS, describes the performance impact on NameNode and MapReduce, and provides detailed configuration steps and compression techniques—including input and output file merging, various Hive file formats, and partition optimization—to efficiently manage storage and resource consumption in big‑data environments.

HadoopHiveSmall Files
0 likes · 19 min read
Handling Small Files in Hive: Configuration, Compression, and File Format Optimization
dbaplus Community
dbaplus Community
Feb 25, 2020 · Backend Development

How to Merge Small Files in Flink Checkpoints to Reduce HDFS Load

This article explains a small‑file‑merging technique for Apache Flink checkpoints that reuses FSDataOutputStreams to combine multiple state files into a single HDFS file, detailing design considerations such as concurrent checkpoint support, reference‑counted deletion, space amplification reduction, fault handling, compatibility, and observed production performance gains.

Apache FlinkCheckpointHDFS
0 likes · 13 min read
How to Merge Small Files in Flink Checkpoints to Reduce HDFS Load
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 7, 2020 · Big Data

Why Small Files Are a Problem in Big Data and How Delta Lake Compaction Solves It

This article examines the root causes and performance impact of massive small-file proliferation in traditional data warehouses, explains why HDFS metadata limits scalability, and details how Delta Lake’s custom compaction process can safely merge these files for append-only tables without disrupting reads or writes.

Delta LakeHDFSSmall Files
0 likes · 5 min read
Why Small Files Are a Problem in Big Data and How Delta Lake Compaction Solves It
360 Quality & Efficiency
360 Quality & Efficiency
Jun 11, 2019 · Backend Development

NebulasFs: A Distributed High‑Availability Small‑File Storage System Developed by 360 Infrastructure Team

NebulasFs is a self‑developed distributed, highly available, and persistent storage system designed to efficiently store billions of small files, offering simple RESTful APIs, automatic request routing, multi‑tenant isolation, customizable replication, automated scaling, rebalancing, and fault‑tolerant replica recovery for large‑scale unstructured data workloads.

NebulasFsSmall Filescloud
0 likes · 8 min read
NebulasFs: A Distributed High‑Availability Small‑File Storage System Developed by 360 Infrastructure Team
360 Tech Engineering
360 Tech Engineering
Jun 11, 2019 · Databases

NebulasFs: A Distributed High‑Availability Small‑File Storage System

NebulasFs is a self‑developed distributed, highly available and durable storage system designed to efficiently store billions of small files by using a master‑datanode architecture, multi‑tenant isolation, customizable replication, automatic scaling, and automated replica repair, addressing the challenges of massive unstructured data generated by modern applications.

Cloud NativeNebulasFsReplication
0 likes · 7 min read
NebulasFs: A Distributed High‑Availability Small‑File Storage System