Big Data 14 min read

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

This article details how NetEase Yanxuan transformed its data platform from a dual Lambda architecture to a unified batch‑stream solution built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and Delta Lake, implementation of stream‑batch pipelines, message ordering fixes, snapshot generation, and extensive table‑governance optimizations.

NetEase Yanxuan Technology Product Team

Feb 27, 2023

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

Introduction

Iceberg has become a popular data‑lake format for unified batch‑stream storage. The article explains how NetEase Yanxuan upgraded from a traditional Lambda architecture to an Iceberg‑based batch‑stream system and the technical problems solved during the migration.

Yanxuan Data Architecture

Online data originates from MySQL binlog and log files, collected by Kafka. Offline batch jobs run on Spark, while real‑time stream jobs run on Flink. After ingestion, data is stored in an ODS layer, then modeled and finally written to Doris, Redis, Elasticsearch, and other stores.

Problems with the Existing Lambda Architecture

Two separate code bases for batch and streaming increase development effort.

Offline processing suffers from low timeliness because snapshot frequency dictates latency.

Multiple components and long pipelines raise maintenance costs.

Solution Selection

After evaluating Iceberg, Hudi, and Delta Lake, Yanxuan chose Iceberg because it is engine‑agnostic, supports upserts and time‑travel, and had the most suitable feature set for their needs.

Iceberg Batch‑Stream Implementation

Data ingestion now includes an AutoETL component that parses raw Kafka messages, performs lightweight cleaning, and writes structured data to an ODS layer, which is then upserted into Iceberg tables. Iceberg’s upsert handles database changes, while Flink checkpoints control data latency. Real‑time scenarios that require sub‑millisecond latency still use direct Kafka pipelines.

Message Ordering and Deduplication

Two approaches were considered for out‑of‑order or duplicate Kafka messages:

Discard the newer record if a later timestamp already exists in the base table.

Write the later record first, then back‑fill the earlier one to preserve all data for accurate snapshot creation.

Yanxuan adopted the second approach to avoid data loss for historical snapshots.

To reduce the cost of frequent base‑table lookups, a cache and statistical metadata were added to filter messages before querying the base table. The cache stores recent primary‑key timestamps; statistical metadata records the maximum processing time per topic‑partition, allowing early detection of potential disorder.

Consistent Snapshot Generation

Iceberg records a new snapshot on each commit. To obtain a point‑in‑time view, the system finds the latest snapshot with max(eventTime) ≤ T0, gathers subsequent data and delete files, filters out records with eventTime > T0, and merges the remaining data with the selected snapshot.

Find snapshot s1 where max(eventTime) ≤ T0.

Collect dataFile and deleteFile changes after s1 as set {F0}.

Remove files with min(eventTime) > T0 to obtain {F1}.

From {F1}, extract records with eventTime ≤ T0 as set {D}.

Merge s1 with {D} to produce the consistent snapshot for T0.

Iceberg Table Governance

Frequent checkpoints (every 10 minutes) generate many small files, degrading query performance. Yanxuan built three services:

DataCompactionService : merges data and delete files.

DataRewriteService : rewrites EqualDeleteFile to PositionDeleteFile and reorders files.

DataCleanService : removes orphan files and expired snapshots.

DeleteFile Rewrite

EqualDeleteFile requires row‑level filtering, which is costly when many files exist. Converting them to PositionDeleteFile eliminates per‑row checks. Multiple small PositionDeleteFiles are then merged into larger ones to reduce file count.

File Reordering

Iceberg stores column min/max statistics for each data file. By reordering files according to the primary key, queries can prune irrelevant files more effectively, improving the hit rate of the cache‑based filter.

Performance Impact

Combining cache‑based filtering, small‑file compaction, delete‑file rewrite, and file reordering yielded query latency reductions of over tenfold in most cases.

Deployment Results and Future Plans

Completed batch‑stream fusion at the ODS layer.

Reduced offline data latency to five minutes.

Advanced T+1 snapshot creation by half an hour.

Stably running more than 500 tasks.

Future work includes extending batch‑stream integration to feature engineering and DWD layers, adding Presto support for Iceberg, integrating Alluxio for metadata caching, and implementing Z‑order sorting and Bloom‑filter indexes to further boost query efficiency. Table‑monitoring and health‑check features will also be productized for better usability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Apache Flink data lake Iceberg Apache Spark Batch-Stream Integration Table Governance

Written by

NetEase Yanxuan Technology Product Team

The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.