Tagged articles
3 articles
Page 1 of 1
Python Crawling & Data Mining
Python Crawling & Data Mining
Sep 10, 2024 · Backend Development

Merging Files by Keyword with Python and Pandas

This article walks through a Python‑based solution that extracts files sharing specific keywords, pulls numeric data from the second column, and concatenates the results horizontally using pandas, providing clear code snippets and practical tips for automating such file‑processing tasks.

AutomationPythondata-processing
0 likes · 6 min read
Merging Files by Keyword with Python and Pandas
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 12, 2022 · Big Data

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

This article explains how Spark reads data from Apache Iceberg tables by parsing snapshots and manifest files into DataFile objects, creates Batch and InputPartition objects, uses readers to materialize InternalRows, and then demonstrates how Iceberg's RewriteDataFilesAction can merge tiny Parquet files into larger ones through Spark‑driven tasks.

Big DataData LakeIceberg
0 likes · 17 min read
Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 27, 2020 · Big Data

Understanding and Solving the Small File Problem in Big Data Systems

This article examines the pervasive small‑file issue in big‑data environments, explains its impact on storage and processing performance, and presents a comprehensive set of solutions—including file merging, Hadoop archives, SequenceFiles, HBase, CombineFileInputFormat, and Spark/Flink strategies—to mitigate metadata overhead and improve I/O efficiency.

FlinkHadoopNameNode
0 likes · 41 min read
Understanding and Solving the Small File Problem in Big Data Systems