Big Data 12 min read

Efficient Data Update/Delete and Real‑time Processing in the Arctic Lakehouse System

This article explains the evolution from traditional data warehouses to modern lakehouse architectures, introduces the Arctic system’s dynamic hash tree for fast update/delete, describes file splitting with sequence/offset ordering, and compares copy‑on‑write versus merge‑on‑read techniques for achieving low‑latency analytics.

DataFunTalk

May 16, 2021

Efficient Data Update/Delete and Real‑time Processing in the Arctic Lakehouse System

Since the 1980s, data analysis has progressed from enterprise data warehouses (EDW) to data lakes and now to cloud‑native lakehouse solutions that combine the strengths of both.

Enterprise warehouses store data centrally with schema‑on‑write but struggle with only structured data, high cost, and limited scalability. Data lakes support structured and unstructured data at lower cost using schema‑on‑read, yet they often suffer from governance and quality issues. Lakehouse architectures aim to merge these benefits, offering distributed storage, low cost, ACID transactions, UPDATE/DELETE support, and near‑real‑time processing.

The concept was first introduced by Databricks in 2020, and open‑source implementations such as Delta Lake, Apache Hudi, and Apache Iceberg have followed. NetEase’s Arctic system, launched in early 2020, implements a lakehouse with efficient update/delete capabilities, hourly‑level data latency, and strong energy‑cost performance.

Efficient Update/Delete

Arctic uses a dynamic hash binary tree called Arctic Tree to partition data by primary‑key hash, storing node location metadata in file attributes. This structure allows fast locating of records without maintaining a full primary‑key index.

Data files are split into INSERT and DELETE streams. To preserve the correct order across these streams, Arctic records a File Sequence (incremented per commit) and a Record Offset (monotonically increasing within a commit). Together they form a logical timeline that guarantees global ordering for both inserts and deletes.

Real‑time Data Access

Lakehouse real‑time performance depends on how and when INSERT and DELETE files are merged. Two main strategies are used:

Copy‑on‑Write (CoW) : A background service in Arctic automatically triggers CoW based on table/file state and user configuration. It can target specific partitions to limit resource usage, but the operation is write‑heavy and may cause write amplification, making it suitable for hour‑level or longer latency.

Merge‑on‑Read (MoR) : Integrated into the query engine (e.g., Spark SQL 3.0), MoR merges INSERT and DELETE streams at read time. Arctic extends Spark with a custom Catalog to distinguish Hive tables from Arctic tables and implements a DataSourceV2 that adds merge logic in the partition reader, delivering a consistent view to users.

Other Features

Arctic also supports automatic full and incremental sync from relational databases (MySQL, Oracle), file governance, access control, and ACID guarantees, making it a comprehensive lakehouse platform.

Future Outlook

The lakehouse will expand beyond OLAP to serve machine‑learning, scientific computing, and point‑lookup workloads. Arctic plans to integrate with additional query engines (Presto, etc.), improve visual modeling via the internal data platform, and add secondary indexes or external caches like Alluxio to further reduce query latency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Lakehouse Copy-on-Write Arctic delete Merge-on-Read Data Update

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.