Big Data 14 min read

How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration

This article explains how NetEase Yanxuan upgraded its legacy Lambda architecture to an Iceberg‑based batch‑stream unified platform, detailing the original data pipeline, the challenges faced, the evaluation of Iceberg versus Hudi and DeltaLake, and the concrete engineering optimizations and governance measures implemented to achieve lower latency and higher query performance.

dbaplus Community

Sep 3, 2023

How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration

Background

In the original Lambda architecture, online data originates from MySQL binlog and log files, which are collected into Kafka. Offline batch processing uses Spark, while real‑time stream processing uses Flink. After synchronization to the ODS layer, data developers model the data and sync results to downstream stores such as Doris, Redis and Elasticsearch.

Issues with the Lambda Architecture

High development cost : separate code bases and engines for batch and real‑time increase effort.

Low offline timeliness : snapshot frequency determines freshness, but higher frequency consumes more resources.

High maintenance cost : multiple components and long pipelines raise operational overhead.

Solution Evaluation

Iceberg, Hudi and DeltaLake were compared. Iceberg was chosen because it provides engine‑agnostic upsert support without tight coupling to Spark or Flink.

Iceberg‑Based Batch‑Stream Architecture

Data‑Ingestion Changes

Log collection remains unchanged (Flume → Kafka). An AutoETL component (Kafka‑to‑Kafka) parses and lightly cleans raw messages before writing structured data to the ODS layer and persisting it into Iceberg.

New Challenges

Kafka message disorder and duplication.

Absence of T+1 snapshots for offline warehouses.

Handling Message Disorder

Two strategies were evaluated:

Discard late messages after checking the latest timestamp in the base table.

Write the late message first, then back‑fill the earlier message (chosen). This retains all messages and enables correct snapshot generation.

Caching and Statistics

A write‑time cache stores the latest event time per primary key. If the cache confirms correct ordering, the expensive base‑table lookup is skipped. On a cache miss, partition‑level statistics (maximum event time per partition) are consulted to decide whether a full lookup is required.

Consistent Snapshot Generation

Iceberg’s versioned metadata enables point‑in‑time snapshots. The algorithm for a snapshot at timestamp T0 is:

1. Find the latest snapshot s1 where max(eventTime) ≤ T0.
2. Collect all dataFile and deleteFile changes after s1 → set {F0}.
3. Remove files from {F0} whose min(eventTime) > T0 → set {F1}.
4. From {F1} read records with eventTime ≤ T0 → set {D}.
5. Merge {D} into s1 to produce the T0 consistent snapshot.

Iceberg Table Governance

Frequent small files caused query slowdown and storage inefficiency. Three governance services were built:

DataCompactionService : merges dataFile, deleteFile and metadata.

DataRewriteService : rewrites EqualDeleteFile to PositionDeleteFile and merges small PositionDeleteFiles.

DataCleanService : removes orphan files and expired snapshots.

DeleteFile Rewrite and Merge

EqualDeleteFile records deletions by primary key, while PositionDeleteFile records deletions by file offset, reducing per‑record checks. Small PositionDeleteFiles are merged into larger ones to lower file count.

File Reordering

Iceberg stores column min/max statistics. Reordering files by primary key improves pruning, allowing queries to skip irrelevant files early and dramatically reducing query latency.

Production Results and Future Work

ODS‑level batch‑stream fusion completed.

Offline latency reduced to approximately 5 minutes.

T+1 snapshot creation advanced by about 30 minutes.

More than 500 stable tasks running in production.

Future plans include extending batch‑stream integration to feature‑engineering and DWD layers, adding Presto support for Iceberg, using Alluxio for metadata caching, introducing Z‑order sorting and Bloom‑filter indexes, and productizing monitoring and health‑check capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Spark Iceberg Batch-Stream Integration Table Governance

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.