Big Data 31 min read

Understanding Apache Iceberg File Storage Format and Write Processes in Spark and Flink

This article explains the Apache Iceberg file storage format, its metadata hierarchy, and demonstrates how Spark and Flink write data to Iceberg tables, including detailed code examples, manifest handling, snapshot management, and commit processes for efficient data lake operations.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Apache Iceberg File Storage Format and Write Processes in Spark and Flink

Apache Iceberg is a modern table format for data lakes that abstracts storage and supports multiple file formats such as Parquet, ORC, and Avro. It organizes data into a metadata layer (VersionMetadata, Snapshot, Manifest) and a data storage layer, enabling features like time travel and serializable isolation.

Metadata layout is illustrated by a directory structure where metadata/ holds version files (e.g., v1.metadata.json,

snap-2080639593951710914-1-1f8279fb-5b2d-464c-af12-d9d6fbe9b5ae.avro

) and data/ stores partitioned data files (e.g., id=1/00000-0-04ae60eb-...-00001.parquet). VersionMetadata JSON records schema, partition specs, snapshot list, and table properties.

The VersionMetadata JSON includes fields like "format-version", "schema" (listing columns and types), "partition-spec", and "snapshots" (each with snapshot-id, timestamp-ms, and summary of operations).

A Snapshot groups one or more manifest entries; each manifest lists data files, their partition bounds, and statistics (record counts, value bounds). This information allows Iceberg to prune files during queries.

Writing with Spark involves two steps: executors write data files using a FileAppender for the chosen format, while the driver collects statistics and creates manifest files. The provided Scala example shows schema definition, partition spec, table creation via HadoopTables, and writing a DataFrame with df.write.format("iceberg").mode("overwrite").save(...).

During the executor phase, data is written to partition directories derived from transform functions, and statistics (e.g., added-data-files, added-records) are gathered for the driver.

Writing with Flink follows a similar pattern. The FlinkSink.forRowData(...).tableLoader(...).build() creates an IcebergStreamWriter that uses a TaskWriterFactory to produce TaskWriter instances (partitioned or unpartitioned). Data is written in processElement, and on checkpoint the writer emits a WriteResult which is serialized into a manifest via FlinkManifestUtil.writeCompletedFiles.

The driver side IcebergFilesCommitter restores state on recovery, writes manifests per checkpoint, and on checkpoint completion merges manifests and commits a new snapshot using either AppendFiles (if no deletes) or RowDelta (if deletes are present). The commit updates table metadata and cleans up temporary manifest files.

Both Spark and Flink rely on Iceberg's snapshot mechanism: they read the latest version-hint.text to locate the current metadata, generate new snapshots with updated manifest lists, and write the updated version.metadata.json and hint file, completing the write transaction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkData LakeSparkApache Iceberg
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.