Big Data 15 min read

OPPO Smart Data Lakehouse: Architecture, Real‑time Lakehouse, and Technical Practices

This article presents OPPO's smart data lakehouse solution, describing its massive EB‑scale architecture, the integration of batch and streaming engines, the Glacier service for table management, schema‑adaptive ingestion, performance optimizations, and future technical road‑maps for unified data processing.

DataFunSummit

Mar 17, 2024

OPPO Smart Data Lakehouse: Architecture, Real‑time Lakehouse, and Technical Practices

As data volumes grow explosively from online transactions, social media, and IoT devices, traditional data management solutions become insufficient. OPPO addresses this challenge with an intelligent lakehouse that combines the scalability of a data lake and the performance of a data warehouse.

The OPPO big‑data platform processes petabytes of data daily, using Spark for offline jobs and Flink for real‑time tasks. The architecture includes an access layer that adapts multiple engines, a compute layer with shared external shuffle services (Shuttle), a scheduling layer optimized for cloud environments such as AWS, an operations diagnostic system for automatic Spark/Flink issue detection, and a data‑lake layer (Glacier) that bridges engine and storage layers.

Glacier Service extends Iceberg by providing unified table management, lifecycle handling, and automatic cleanup. It also adds a distributed cache (Alluxio) and various indexes (bitmap, Z‑order, primary‑key) to accelerate queries.

Key technical innovations include:

Schema‑adaptive ingestion: Flink CDC now captures DDL changes, assigns a schema ID to each record, and automatically updates downstream Iceberg tables.

Data‑source merging: multiple MySQL tables are synchronized through a single CDC task, reducing resource consumption.

Delete‑file optimization: snapshot‑based delete files and bloom filters improve delete‑query performance.

Streaming‑to‑batch conversion with checkpoint‑dump logic ensures exactly‑once semantics and final consistency.

Sample stitching for machine‑learning workloads using upsert tables and index‑driven data retrieval, achieving high‑throughput (≈7 000 rows/s) processing.

The roadmap focuses on deeper cache optimization, unifying batch and streaming engines under Flink, supporting in‑memory formats for large‑model training, and open‑sourcing selected components.

A short Q&A clarifies that OPPO uses an enhanced Alluxio cache for real‑time reads/writes and compares the platform with similar open‑source projects such as NetEase Arctic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Flink Schema Evolution Iceberg Data Lakehouse OPPO

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.