Big Data 11 min read

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

OPPO’s data‑lake team, led by David, detailed their transition from Hive‑Spark to a unified multi‑modal lake, leveraging Gravitino for cross‑engine metadata management and the open‑source Curvine cache to eliminate data silos, boost I/O performance, and support massive image, recommendation, and AI‑Agent workloads.

DataFunSummit
DataFunSummit
DataFunSummit
How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

Background

OPPO’s data‑lake infrastructure has evolved from an offline Hive + limited Spark setup to a hybrid on‑premise and Alibaba Cloud environment over five‑six years. The platform now supports petabyte‑scale multimodal data generated by phone imaging, recommendation search, and edge AI agents.

Multi‑Modal Workloads

Phone Imaging : Large image collections are required to continuously improve camera algorithms.

Multimodal Recommendation : Diverse data types are leveraged to break the performance ceiling of traditional recommendation models.

Multimodal Agents : Internal projects such as restaurant recommendation and daily‑management assistants on mobile devices.

These workloads demand a unified storage, management, and query layer that can handle petabyte‑scale data without silos.

Four‑Layer Architecture of the Multi‑Modal Data Lake

Compute Engine : Spark is used as the universal query engine and is extended with the open‑source Lance 8K project for high‑dimensional vector data.

Unified Metadata Management : Gravitino provides a catalog that supports Hive tables and Lance tables in a single namespace.

Acceleration Layer : The open‑source distributed cache file system Curvine addresses I/O bottlenecks on OSS.

Platform Services : Existing data‑map, permission, and governance services are reused to expose a unified data‑asset portal.

Gravitino: Core Capabilities

Data‑Silo Elimination : All multimodal assets are registered in a central catalog, preventing ad‑hoc directory creation.

Unified Metadata : Supports multiple engines (Hive, Lance) with a single catalog, reducing technical debt.

Cross‑Engine Query : Enables federated SQL that can join Hive and Lance tables, facilitating joint analytics.

Key features include a multi‑cloud catalog, fine‑grained access control, and automatic lineage tracking.

Curvine: Cloud‑Native Distributed Cache

Curvine offers two operating modes:

Cache Mode : Mirrors OSS objects locally to accelerate read‑heavy workloads.

FS Mode : Provides POSIX‑compatible semantics on top of OSS, allowing applications to treat object storage as a local disk.

It supports S3 and HDFS protocols and integrates natively with Kubernetes via the CSI driver.

Typical use cases include caching LanceDB indexes and manifest metadata for fast vector search, pre‑warming hot tables to reduce repeated OSS reads, and accelerating checkpoint writes during model training.

Performance Evaluation

Using the dbpedia-entities-openai-1M benchmark, Curvine‑accelerated LanceDB achieved query latency comparable to the commercial version of LanceDB, while direct OSS storage lagged significantly. The results demonstrate that an open‑source cache can close the performance gap with proprietary solutions.

Future Outlook

Beyond caching, Curvine plans to evolve into a data‑transformation service that will automatically convert incoming data to Lance format, generate indexes, and merge small files to alleviate the small‑file problem common in multimodal lakes. The project is hosted at https://github.com/curvineio/curvine and welcomes community contributions.

Conclusion

The combination of Gravitino for unified metadata and Curvine for I/O acceleration enables OPPO to manage petabyte‑scale multimodal data efficiently, providing a single source of truth, fast query performance, and a foundation for future AI‑driven services.

OPPO multimodal data lake overview
OPPO multimodal data lake overview
big dataOpen-sourcedistributed cachemultimodaldata lakemetadata management
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.