Big Data 12 min read

Understanding Apache Iceberg: Design, Architecture, and Its Application at NetEase Cloud Music

This article explains Apache Iceberg’s table‑format design, compares it with Hive’s limitations, details its snapshot‑based architecture and metadata handling, and describes how NetEase Cloud Music leveraged Iceberg to dramatically improve large‑scale log processing performance and stability.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Understanding Apache Iceberg: Design, Architecture, and Its Application at NetEase Cloud Music

This article introduces Apache Iceberg from a new perspective, building on a previous quick‑start guide, and shares practical experiences of using Iceberg at NetEase Cloud Music.

Unlike traditional file formats such as Parquet or ORC, Iceberg is a table format that addresses Hive’s shortcomings—including unreliable updates, costly column renames, excessive partition counts, and fragmented metadata stored separately from files—by providing atomic, versioned table metadata.

Iceberg’s design goals are to become an open, language‑agnostic static data storage standard, offer strong extensibility and reliability, and solve storage availability issues through robust schema management, time‑travel, and multi‑version support.

Its detailed architecture uses snapshots that contain lists of data files, manifests that describe partitions and file statistics, and MVCC to allow concurrent reads and writes without interference; the rich metadata enables efficient predicate push‑down and file‑level pruning.

At NetEase Cloud Music, daily user‑behavior logs reach 25‑30 TB and over 110 k files, causing severe NameNode pressure and long task‑initialization times. By creating an Iceberg table via HadoopCatalog, partitioned by hour and behavior, and cleaning logs before ingestion, initialization time dropped from 30‑60 minutes to 5‑10 minutes, greatly improving ETL speed and stability.

The Iceberg table consists of metadata and data directories; each metadata file represents a snapshot containing schema, task, and manifest information. Manifest‑list files aggregate manifest metadata, as illustrated by the following avro‑tools commands and JSON output:

java -jar avro-tools-1.9.2.jar tojson --pretty snap-8844883026140670978-1-0e32a3de-51d1-4641-9235-181c87a8a2f8.avro
{ "manifest_path": "/user/da_music/out/.../metadata/0e32a3de-51d1-4641-9235-181c87a8a2f8-m0.avro", "manifest_length": 790541, ... }

Writing data to Iceberg requires globally sorted files per partition; Spark settings such as increasing spark.driver.maxResultSize and tuning spark.sql.shuffle.partitions are essential. Example code for sorting and writing:

uaDF.sort(expr("hour"), expr("group"), expr("action"), expr("logtime"))
  .write.format("iceberg")
  .option("write.parquet.row-group-size-bytes", 256 * 1024 * 1024)
  .mode(SaveMode.Overwrite)
  .save(output)

Iceberg is file‑format agnostic (supports Avro, ORC, Parquet) and shines with very large tables; future work includes merge support to handle small‑file problems and MergeInto capabilities similar to Hudi and Delta Lake for batch‑stream unified warehousing.

References: official Iceberg site, Table Format video, Iceberg introduction videos, and Netflix’s Iceberg streaming case study.

Big Datadata lakemetadata managementSparkApache IcebergTable Format
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.