Big Data 14 min read

How NetEase’s Arctic Unifies Streaming and Batch with Iceberg for Real‑Time Lakehouse

This article explains the challenges of a Lambda‑architecture data pipeline, introduces NetEase’s Arctic lakehouse built on Apache Iceberg, details its table‑store design, optimization cycles, consistency mechanisms, real‑time features, practical use cases, and future roadmap, highlighting its advantages over similar solutions.

ITPUB

Jan 26, 2023

How NetEase’s Arctic Unifies Streaming and Batch with Iceberg for Real‑Time Lakehouse

Background and Business Challenge

NetEase’s data platform originally followed a Lambda architecture that separated streaming and batch processing, leading to data silos, low developer efficiency, and inconsistent metrics. Real‑time data was ingested via a message queue (CDC and log data) and processed with Flink, writing to Kudu for low‑latency streams and Hive for batch workloads.

Because streaming and batch pipelines were isolated, the system suffered from fragmented ecosystems, duplicated storage, and difficulty reusing offline data for online queries.

Arctic Overview and Positioning

Arctic is a lakehouse‑as‑a‑service built on Apache Iceberg. It sits between Hive/Iceberg and the compute engine, providing a TableService that optimizes table schemas and encapsulates KV stores such as Kafka, Redis, and HBase for real‑time access.

Arctic introduces two logical stores per table: a Change store for streaming writes and a Base store for batch writes. Both stores are implemented as Iceberg tables, enabling MVCC, ACID guarantees, and incremental queries.

Table Store Architecture

Data written to the Change store is periodically merged into the Base store through asynchronous optimizing jobs. This design provides:

Small‑file management and unique‑key enforcement.

Upsert semantics via merge on read.

Support for primary‑key based deduplication.

Two optimization cycles are defined:

Minor Optimize (every 5‑10 minutes) – cleans small files and converts equal‑delete to position‑delete.

Major Optimize (daily) – merges change files into base files, producing Hive‑compatible snapshots.

Real‑Time Features: Hidden Queue and Consistency

Arctic embeds a hidden Kafka queue inside the table. When enabled, upstream Flink tasks perform dual writes: one to the Change store and one to the hidden queue, allowing downstream consumers to achieve sub‑second CDC latency.

To guarantee final consistency across dual writes, Arctic uses a checkpoint‑based rollback mechanism. Each message carries the upstream writer’s checkpoint index; if a downstream failure occurs, a special Flip message triggers a retract operation that rolls back incomplete writes, all encapsulated in the Arctic‑Flink‑connector.

Temporal Tables and Future Consistency Improvements

Arctic plans to expose a hidden index that abstracts HBase/Redis as a built‑in dimension table, eventually supporting temporal joins without external KV stores, leveraging Flink 1.12’s temporal table capabilities.

Transaction ID (txId) Management

Each Flink checkpoint requests a unique txId from Arctic; this ID is stored alongside the written files. Spark jobs similarly acquire a txId during planning. During reads, merge on read uses the ordering of txId values to resolve the latest record, ensuring deterministic results even when stream and batch writes interleave.

Practical Use Cases

In NetEase Cloud Music, Arctic powers a push‑notification attribution pipeline. Two log streams (main‑site and algorithmic) and a MySQL dimension table are joined in real time via Kafka‑driven left joins, delivering a single source of truth for both batch reports and online analytics without pipeline changes.

Future Roadmap

Enhance stream‑batch integration with roll‑up aggregation views, sort‑key support, and partial column upserts.

Improve lineage tracking and self‑service query capabilities in the dashboard.

Introduce more open‑source permission plugins (e.g., Ranger integration).

Extend support to additional object storage backends such as S3 and OSS.

Comparison with Hudi and Advantages

Arctic shares a similar positioning with Apache Hudi but differs in several ways:

Hudi’s CDC latency is minute‑level, while Arctic can achieve second‑level latency via the hidden queue.

Arctic’s underlying storage is Iceberg, offering broader future compatibility and better Hive integration.

Key advantages of Arctic include:

Full Iceberg compatibility and seamless Hive migration.

Automatic, minute‑level merge on read for near‑real‑time data warehouses.

Hidden queue enabling sub‑second streaming joins.

Comprehensive meta‑service (AMS) that manages table metadata, transaction IDs, and optimization scheduling, with a user‑friendly dashboard.

Meta Service (AMS)

Arctic Meta Service (AMS) acts as a future Hive Metastore, handling metadata management, transaction ID allocation, and triggering optimization jobs based on time or file‑size thresholds. It also provides resource‑aware scheduling and operational dashboards.

Overall, Arctic demonstrates a practical, production‑grade solution for unifying streaming and batch workloads in a lakehouse architecture, addressing consistency, latency, and operational challenges while offering a migration path from existing Hive ecosystems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Flink Data Integration Iceberg Lakehouse Arctic

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.