Big Data 20 min read

Glacier: An Intelligent Data Lake Architecture for Real‑Time Analytics and Machine Learning

This article presents Glacier, OPPO's intelligent data lake solution that builds on Iceberg Table Format to provide real‑time data ingestion, low‑latency queries, advanced indexing, and robust multi‑version management for both structured and unstructured data, tightly integrating with machine‑learning workflows.

DataFunTalk

Dec 31, 2022

Glacier: An Intelligent Data Lake Architecture for Real‑Time Analytics and Machine Learning

In recent years, the data lake ecosystem has revived with concepts such as lakehouse and intelligent lakehouse; data lakes store massive structured, semi‑structured, and unstructured data, offering flexible schema‑on‑read capabilities compared to traditional warehouses.

Machine‑learning advancements demand real‑time training and recommendation, driving the convergence of streaming and batch processing and prompting the need for a lakehouse that supports rapid data insertion, real‑time queries, and versioned data for ML workloads.

Among open‑source lakehouse table formats—Iceberg, Hudi, DeltaLake—Iceberg is selected for its open API, strong engine compatibility, and robust partition handling, making it the foundation for OPPO's Glacier.

Glacier’s overall architecture adds a transparent engine layer compatible with Iceberg, a resident Glacier Service for data merging and index construction, a distributed‑memory cache (Glacier cache) for real‑time reads/writes, and Glacier Version for fine‑grained data version management.

Append Scenario (Real‑time Recommendation) : Flink tasks commit data to Iceberg every few seconds, generating many small files that stress storage and degrade query performance. Iceberg’s native merge‑on‑read can compact files but cannot meet strict real‑time SLAs, prompting Glacier’s cache‑based solution.

Glacier cache uses Netty‑wrapped RPC with zero‑copy, pre‑read, and flow‑control, and ensures consistency via Raft. Data is stored in Arrow format for efficient in‑memory querying; once a threshold is reached, Glacier Service merges and persists data to underlying storage.

CDC Scenario (Delete Handling) : Iceberg supports Position delete (fast but requires file offsets) and Equality delete (slower, based on column values). Glacier optimizes Equality delete by inserting delete rows into an in‑memory del‑map, filtering DataBlocks before disk write, and using Bloom filters to route deletes to corresponding data files, achieving performance comparable to Position delete.

Performance comparison (Table 3) shows Glacier’s optimized Equality delete matching Position delete speed.

Index Acceleration : Glacier Service builds indexes (Bloom, bitmap, incremental Z‑Order, primary‑key inverted index) during data sync, reducing I/O cost. Incremental Z‑Order avoids full‑data re‑sorting by only sorting new data, delivering significant query speedups on TPCH benchmarks.

Primary‑key indexes use inverted indexes and Finite State Transducers (FST) for low‑memory, high‑compression storage, yielding tens‑fold query acceleration for equality and range predicates.

Data Multi‑Version Management : Glacier Version extends Git‑like operations (clone, commit, branch, merge) to both structured and unstructured data, using Merkle Trees for fast metadata handling (5000+ rps, <30 ms latency, PB‑scale commits), overcoming Git’s limitations on large datasets.

Unstructured Data Management : Glacier‑Version converts unstructured data (e.g., images, tensors) into columnar blocks using formats like HDF5, Zarr, or N5, adding tags for metadata. This enables streaming reads, memory reuse via fixed‑size blocks, and higher CPU/GPU utilization.

Non‑structured data is transformed into structured columnar blocks, allowing pipelined processing without waiting for full transfer, improving transmission efficiency, memory reuse, and execution speed.

Conclusion and Outlook : Glacier delivers a complete intelligent lakehouse platform that integrates real‑time data services, automated indexing, incremental Z‑Order, and optimized delete handling, while providing robust version control for both data and models. Future work includes automated index selection, vector‑data query acceleration, and deeper ML integration, complemented by OPPO’s Shuttle (compute accelerator) and LakeLink (engine adapter).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real-time analytics Data Lake Iceberg version-management Glacier

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.