OPPO Smart Data Lakehouse: Architecture, Real‑time Lakehouse, and Technical Practices
This article presents OPPO's smart data lakehouse solution, describing its massive EB‑scale architecture, the integration of batch and streaming engines, the Glacier service for table management, schema‑adaptive ingestion, performance optimizations, and future technical road‑maps for unified data processing.
As data volumes grow explosively from online transactions, social media, and IoT devices, traditional data management solutions become insufficient. OPPO addresses this challenge with an intelligent lakehouse that combines the scalability of a data lake and the performance of a data warehouse.
The OPPO big‑data platform processes petabytes of data daily, using Spark for offline jobs and Flink for real‑time tasks. The architecture includes an access layer that adapts multiple engines, a compute layer with shared external shuffle services (Shuttle), a scheduling layer optimized for cloud environments such as AWS, an operations diagnostic system for automatic Spark/Flink issue detection, and a data‑lake layer (Glacier) that bridges engine and storage layers.
Glacier Service extends Iceberg by providing unified table management, lifecycle handling, and automatic cleanup. It also adds a distributed cache (Alluxio) and various indexes (bitmap, Z‑order, primary‑key) to accelerate queries.
Key technical innovations include:
Schema‑adaptive ingestion: Flink CDC now captures DDL changes, assigns a schema ID to each record, and automatically updates downstream Iceberg tables.
Data‑source merging: multiple MySQL tables are synchronized through a single CDC task, reducing resource consumption.
Delete‑file optimization: snapshot‑based delete files and bloom filters improve delete‑query performance.
Streaming‑to‑batch conversion with checkpoint‑dump logic ensures exactly‑once semantics and final consistency.
Sample stitching for machine‑learning workloads using upsert tables and index‑driven data retrieval, achieving high‑throughput (≈7 000 rows/s) processing.
The roadmap focuses on deeper cache optimization, unifying batch and streaming engines under Flink, supporting in‑memory formats for large‑model training, and open‑sourcing selected components.
A short Q&A clarifies that OPPO uses an enhanced Alluxio cache for real‑time reads/writes and compares the platform with similar open‑source projects such as NetEase Arctic.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.