Big Data 14 min read

Migrating from Lambda Architecture to an Iceberg‑Based Unified Batch‑Stream Architecture at NetEase Yanxuan

This article details how NetEase Yanxuan upgraded its legacy Lambda data pipeline to a unified batch‑stream architecture built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and DeltaLake, implementation specifics, table‑governance techniques, and future roadmap.

DataFunTalk
DataFunTalk
DataFunTalk
Migrating from Lambda Architecture to an Iceberg‑Based Unified Batch‑Stream Architecture at NetEase Yanxuan

Guest Speaker: Zhu Jiajun, Senior Development Engineer at NetEase Yanxuan (DataFunTalk platform).

Overview: NetEase Yanxuan collects MySQL binlog and log‑event data via Kafka, processes offline batches with Spark and real‑time streams with Flink, and stores results in Doris, Redis, Elasticsearch, etc. The existing Lambda architecture suffers from duplicated code bases, low offline latency, and high maintenance cost.

Problem Summary:

Two separate pipelines (Lambda) requiring separate code and engines.

Offline data freshness depends on snapshot frequency, consuming resources.

Complex component chain leads to high maintenance overhead.

Solution Comparison: Iceberg, Hudi, and DeltaLake all support upsert, transactions, and time‑travel. Hudi was tightly coupled with Spark, DeltaLake required Spark‑Streaming while Yanxuan uses Flink, leaving Iceberg as the most engine‑agnostic choice.

Iceberg Overview: A table format that decouples storage from compute, supports batch‑stream reads, transactions, upserts, schema evolution, hidden partitioning, and stores file‑level statistics for query pruning.

Unified Batch‑Stream Implementation:

Log collection unchanged (Flume → Kafka). Added an AutoETL (Kafka‑to‑Kafka) to parse, clean, and write structured data to an ODS topic, then stream it into Iceberg.

Iceberg upserts handle DB changes; however, latency depends on Flink checkpoints, so some millisecond‑level use‑cases still use direct Kafka pipelines.

Message Disorder & Duplication: Two mitigation strategies were evaluated. Strategy 1 discards out‑of‑order records after checking the latest timestamp in the base table. Strategy 2 writes the later record first, then appends the earlier one to preserve all data for accurate snapshot generation. Yanxuan chose Strategy 2 to avoid data loss for historical snapshots.

To reduce the heavy base‑table lookups, caching and statistical metadata were introduced: a cache of recent primary‑key timestamps and a partition‑level statistic that flags potential disorder, allowing many messages to be filtered without a table scan.

Consistent Snapshot Generation: Iceberg’s versioned metadata enables time‑travel. To obtain a snapshot for time T₀, the process is:

Find the latest snapshot s₁ with max(eventTime) ≤ T₀.

Collect all data‑files and delete‑files added after s₁ ({F₀}).

Filter out files whose min(eventTime) > T₀, yielding {F₁}.

Read records from {F₁} with eventTime ≤ T₀, forming set {D}.

Merge s₁ with {D} to produce the consistent T₀ snapshot.

Iceberg Table Governance:

Small File Explosion: Frequent 10‑minute checkpoints create many sub‑100KB files, degrading query performance. A governance service (DataCompactionService, DataRewriteService, DataCleanService) merges files, rewrites delete‑files, and cleans orphan files.

Delete‑File Rewrite: Convert EqualDeleteFile (row‑level) to PositionDeleteFile (file‑position) and merge small PositionDeleteFiles to reduce file count.

Reordering: Record statistics (min/max per column) enable predicate push‑down. Reordering data by primary key makes files internally sorted, improving filter effectiveness.

Combined caching, small‑file compaction, delete‑file rewrite, and reordering yielded >10× query‑performance improvements in production.

Future Plans: Extend batch‑stream integration to feature‑engineering and DWD layers, add Presto support for Iceberg, introduce Alluxio caching, Z‑order sorting, Bloom‑filter indexes, and productize monitoring and health‑check capabilities.

Thank you for reading.

Big DataFlinkData LakeSparkicebergBatch-StreamTable Governance
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.