How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration
This article explains how NetEase Yanxuan upgraded its legacy Lambda architecture to an Iceberg‑based batch‑stream unified platform, detailing the original data pipeline, the challenges faced, the evaluation of Iceberg versus Hudi and DeltaLake, and the concrete engineering optimizations and governance measures implemented to achieve lower latency and higher query performance.
Background
In the original Lambda architecture, online data originates from MySQL binlog and log files, which are collected into Kafka. Offline batch processing uses Spark, while real‑time stream processing uses Flink. After synchronization to the ODS layer, data developers model the data and sync results to downstream stores such as Doris, Redis and Elasticsearch.
Issues with the Lambda Architecture
High development cost : separate code bases and engines for batch and real‑time increase effort.
Low offline timeliness : snapshot frequency determines freshness, but higher frequency consumes more resources.
High maintenance cost : multiple components and long pipelines raise operational overhead.
Solution Evaluation
Iceberg, Hudi and DeltaLake were compared. Iceberg was chosen because it provides engine‑agnostic upsert support without tight coupling to Spark or Flink.
Iceberg‑Based Batch‑Stream Architecture
Data‑Ingestion Changes
Log collection remains unchanged (Flume → Kafka). An AutoETL component (Kafka‑to‑Kafka) parses and lightly cleans raw messages before writing structured data to the ODS layer and persisting it into Iceberg.
New Challenges
Kafka message disorder and duplication.
Absence of T+1 snapshots for offline warehouses.
Handling Message Disorder
Two strategies were evaluated:
Discard late messages after checking the latest timestamp in the base table.
Write the late message first, then back‑fill the earlier message (chosen). This retains all messages and enables correct snapshot generation.
Caching and Statistics
A write‑time cache stores the latest event time per primary key. If the cache confirms correct ordering, the expensive base‑table lookup is skipped. On a cache miss, partition‑level statistics (maximum event time per partition) are consulted to decide whether a full lookup is required.
Consistent Snapshot Generation
Iceberg’s versioned metadata enables point‑in‑time snapshots. The algorithm for a snapshot at timestamp T0 is:
1. Find the latest snapshot s1 where max(eventTime) ≤ T0.
2. Collect all dataFile and deleteFile changes after s1 → set {F0}.
3. Remove files from {F0} whose min(eventTime) > T0 → set {F1}.
4. From {F1} read records with eventTime ≤ T0 → set {D}.
5. Merge {D} into s1 to produce the T0 consistent snapshot.Iceberg Table Governance
Frequent small files caused query slowdown and storage inefficiency. Three governance services were built:
DataCompactionService : merges dataFile, deleteFile and metadata.
DataRewriteService : rewrites EqualDeleteFile to PositionDeleteFile and merges small PositionDeleteFiles.
DataCleanService : removes orphan files and expired snapshots.
DeleteFile Rewrite and Merge
EqualDeleteFile records deletions by primary key, while PositionDeleteFile records deletions by file offset, reducing per‑record checks. Small PositionDeleteFiles are merged into larger ones to lower file count.
File Reordering
Iceberg stores column min/max statistics. Reordering files by primary key improves pruning, allowing queries to skip irrelevant files early and dramatically reducing query latency.
Production Results and Future Work
ODS‑level batch‑stream fusion completed.
Offline latency reduced to approximately 5 minutes.
T+1 snapshot creation advanced by about 30 minutes.
More than 500 stable tasks running in production.
Future plans include extending batch‑stream integration to feature‑engineering and DWD layers, adding Presto support for Iceberg, using Alluxio for metadata caching, introducing Z‑order sorting and Bloom‑filter indexes, and productizing monitoring and health‑check capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
