Big Data 11 min read

How OPPO Accelerates Multimodal Data & AI Fusion with Gravitino and Curvine

OPPO tackles explosive multimodal data growth by unifying metadata with Gravitino and boosting I/O performance using the open‑source Curvine cache, delivering a four‑layer data‑lake architecture that resolves data islands, metadata chaos, and bandwidth bottlenecks while achieving near‑commercial query speeds.

DataFunSummit
DataFunSummit
DataFunSummit
How OPPO Accelerates Multimodal Data & AI Fusion with Gravitino and Curvine

Multimodal Application Scenarios

OPPO’s multimodal workloads span three core domains: mobile imaging (massive photo datasets for algorithm improvement), multimodal recommendation/search (breaking the plateau of traditional recommendation models), and on‑device AI agents (exploratory projects such as restaurant recommendation and daily‑task assistants). These use cases drive a rapid increase in multimodal data and impose new requirements on the underlying data infrastructure.

Data‑Lake Architecture Design

OPPO’s multimodal data lake is organized into four layers:

Compute Engine: Spark is used as the unified query engine, extended with the open‑source Lance 8K project for vector‑search capabilities.

Unified Metadata Management: Gravitino serves as the industry‑standard catalog, supporting Hive tables and Lance tables in a single namespace.

Acceleration Layer: The internally built, cloud‑native distributed cache file system Curvine mitigates I/O bottlenecks on OSS.

Platform Product Layer: Existing data‑map, permission, and governance services are reused to provide a unified data‑asset management experience.

Why Gravitino?

Two years prior to the presentation, OPPO’s data team faced massive PB‑scale data scattered across scripts with no ownership or governance. Gravitino was adopted to enforce a policy that any new directory must be registered in the catalog, gradually consolidating both incremental and existing metadata.

Gravitino provides three core capabilities:

Unified catalog for Hive and Lance tables, enabling joint management.

Multi‑cloud support (hybrid on‑prem + Alibaba Cloud) that makes data location transparent to workloads.

Global data‑asset visibility, including ownership, daily billing, and upstream/downstream dependencies.

Curvine: High‑Performance Distributed Cache

Curvine is OPPO’s open‑source cloud‑native cache file system. It offers two modes: a cache mode that mirrors OSS objects and an FS mode that presents OSS data with full POSIX semantics. Curvine also supports S3 and HDFS protocols and integrates natively with Kubernetes via CSI.

Typical usage patterns include:

Caching LanceDB indexes and manifest metadata for hot‑path access.

Pre‑warming frequently read tables by loading OSS data onto local cache disks.

Accelerating checkpoint writes during model training.

Speeding up vector queries on the LanceDB dataset, achieving performance comparable to the commercial LanceDB version.

Performance Comparison

Using the dbpedia-entities-openai-1M benchmark, OPPO compared three setups: (1) direct OSS storage, (2) community‑edition LanceDB on OSS, and (3) Curvine‑accelerated community‑edition LanceDB. The results showed that Curvine‑enabled LanceDB queries were almost on par with the commercial LanceDB offering, demonstrating that the open‑source cache reproduces the key performance advantage of the proprietary product.

Future Outlook

Curvine is evolving from a pure caching service to a broader data‑transformation middleware. Planned features include automatic conversion of incoming data to Lance format, index‑building services, and automatic small‑file merging to address the typical small‑file problem in multimodal data lakes. The project is open‑source (https://github.com/curvineio/curvine) and invites industry partners to contribute.

Overall, OPPO’s multimodal data‑lake implementation demonstrates that unified metadata (Gravitino) combined with a high‑performance cache layer (Curvine) can effectively resolve data silos, metadata chaos, and I/O bottlenecks, enabling scalable AI workloads on cloud infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed CacheCloud StorageMetadata ManagementSparkGravitinoMultimodal Data LakeLanceDBCurvine
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.