NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation
This article examines the pain points of traditional data warehouse platforms, explains the core concepts and advantages of the Iceberg data lake table format, compares it with Metastore, reviews the current Iceberg community ecosystem, and details NetEase’s practical integration with Hive, Impala, and Flink to improve ETL efficiency and support unified batch‑stream processing.
Introduction
NetEase data‑lake expert Fan Xinxin shares the motivations behind adopting Iceberg, starting from the limitations of their existing data‑warehouse platform and the need for a more efficient, reliable, and scalable solution.
Data‑Warehouse Platform Pain Points
Large offline jobs experience unpredictable latency due to massive data volumes, heavy NameNode requests, low ETL efficiency, and retry overhead.
Unreliable update operations cause failures when partitions are modified during reads.
Schema changes are costly because they require full data rewrites.
Lambda architecture incurs high maintenance cost, duplicate pipelines, and NameNode pressure.
Iceberg Core Principles
Iceberg is an open‑source table format that provides a high‑level abstraction independent of any execution engine. Its key features include:
Schema definition supporting primitive and complex types.
Partitioning expressed as a table column, eliminating extra NameNode list calls.
File‑level metadata (statistics per data file) enabling more effective predicate push‑down.
ACID‑compliant read/write APIs with snapshot commits.
Comparison with Metastore
Schema support is identical.
Iceberg stores partition values in the table itself, while Metastore treats partitions as directory structures, leading to extra HDFS list operations.
Iceberg’s statistics are at file granularity, offering finer‑grained pruning than Metastore’s table/partition level stats.
Iceberg writes use snapshot commits, providing atomicity and enabling incremental reads; Metastore relies on add‑partition calls.
Community Status
Iceberg currently supports Spark 2.4.5, Spark 3.x, and Presto. Ongoing work includes Hive and Flink integrations and adding update/delete capabilities.
NetEase Practical Implementation
Integrated Iceberg with Hive for table creation, deletion, and SQL queries.
Contributed Iceberg support to Impala, allowing both internal and external Iceberg tables.
Implemented a Flink sink for Iceberg, enabling streaming writes from Kafka and asynchronous small‑file merging via snapshot commits.
These integrations dramatically improve ETL job performance by reducing NameNode pressure, leveraging file‑level statistics for pruning, and providing a unified batch‑stream storage model.
Conclusion
Iceberg’s new partition model, metadata granularity, and API design address the four major pain points of traditional data warehouses, offering higher query performance, lower operational overhead, and seamless batch‑stream processing.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.