Big Data 15 min read

Arctic: NetEase's Streaming Lakehouse Service and Hive-Based Stream-Batch Integration Practice

Arctic, NetEase’s streaming lakehouse built on Apache Iceberg, unifies streaming and batch workloads with millisecond‑level latency, Hive compatibility, and built‑in message‑queue support, delivering CDC, upserts and OLAP without a Lambda architecture, as demonstrated by real‑time processing of 2 PB of Hive data for Cloud Music.

NetEase Cloud Music Tech Team

Oct 26, 2022

Arctic: NetEase's Streaming Lakehouse Service and Hive-Based Stream-Batch Integration Practice

As big data business develops, the Hive-based data warehouse system gradually fails to meet growing business demands. While serving a large user base, it severely lacks in real-time capabilities and functionality. Although systems like Hudi and Iceberg bring significant improvements in transaction handling and snapshot management, they impose substantial migration costs on existing Hive users and cannot meet millisecond-level latency requirements for stream processing.

To address the need for stream-batch unified business, NetEase Data帆 developed a new generation of streaming lakehouse based on Apache Iceberg. Compared to traditional lakehouses like Hudi and Iceberg, it provides streaming updates, dimension table joins, partial upsert等功能, and integrates Hive, Iceberg, and message queues into a unified streaming lakehouse service, enabling out-of-the-box stream-batch unification.

What is Arctic

Arctic is a Streaming Lakehouse Service built on Apache Iceberg. Compared to data lakes like Iceberg, Hudi, and Delta, Arctic provides more optimized CDC, streaming updates, and OLAP capabilities. Combined with Iceberg's efficient offline processing, Arctic can serve more stream-batch hybrid scenarios. Arctic also includes self-optimizing structure, concurrent conflict resolution, and standardized lakehouse management features.

Arctic Table relies on Iceberg as the base table format but uses Iceberg as a library without invasive modifications. As a streaming lakehouse specifically designed for stream-batch unified computing, Arctic Table also encapsulates message queues as part of the table, providing lower message latency in streaming scenarios along with streaming updates and primary key uniqueness guarantees.

Stream-Batch Unified Solution

In real-time computing, Kafka and similar message queues are typically used for streaming tables due to low latency requirements. In offline computing, Hive is used as the offline table, and additional OLAP systems like Kudu are needed to support real-time data output. This is the typical Lambda architecture.

Arctic's core goal is to provide Lambda-free data lake solutions for businesses. Using Arctic instead of Kafka and Hive achieves storage substrate stream-batch unification.

Key features include: Message Queue encapsulation enabling different compute engines (Spark, Flink, Trino) to access without distinguishing between streaming and batch tables; millisecond-level streaming compute latency with data write and read consistency guarantees; minute-level OLAP latency through streaming writes and Merge-on-Read queries.

Table Store Architecture

Arctic Table consists of different Table Stores: ChangeStore (Iceberg table representing incremental data, written by Flink for near real-time consumption), BaseStore (Iceberg table representing historical data from batch computing or merged from ChangeStore via Optimizer), and LogStore (message queue like Kafka encapsulated for millisecond-level CDC data distribution).

Hive Compatibility

Arctic provides comprehensive Hive compatibility: data access layer compatibility (data written by Arctic can be read by Hive and vice versa), metadata layer compatibility (Arctic tables registered in HMS can感知 Hive DDL changes), ecosystem compatibility (can reuse Ranger for permission management), and support for upgrading existing Hive tables to Arctic with minimal cost.

The Hive-compatible Table Store divides BaseStore into two directory spaces: hive location for native Hive reader compatibility, and iceberg location for near real-time data managed by Iceberg.

Hive Data Synchronization

Arctic introduces Hive Syncer to identify table structure and data changes through Hive query engine, including Table Metadata Sync (comparing Arctic Table Schema with Hive Table Schema to automatically identify DDL changes) and Table Data Sync (detecting changes via transient_lastDdlTime and HDFS listDir operations).

Practice Case: NetEase Cloud Music Feature Production Engineering Real-time

NetEase Cloud Music's recommendation business had a mature Spark+Hive big data+AI development system. Using Arctic Hive-compatible tables, they achieved approximately 2PB of Hive table real-time processing without modifying existing T+1 data replenishment or analyst report queries. The upgrade preserved offline chain development while bringing real-time capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Lake Big Data Architecture Hive Compatibility Apache Iceberg real-time data processing Arctic Stream-Batch Unification streaming lakehouse

Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.