How OPPO Accelerates Multimodal Data & AI Fusion with Gravitino and Curvine
OPPO tackles explosive multimodal data growth by unifying metadata with Gravitino and boosting I/O performance using the open‑source Curvine cache, delivering a four‑layer data‑lake architecture that resolves data islands, metadata chaos, and bandwidth bottlenecks while achieving near‑commercial query speeds.
Multimodal Application Scenarios
OPPO’s multimodal workloads span three core domains: mobile imaging (massive photo datasets for algorithm improvement), multimodal recommendation/search (breaking the plateau of traditional recommendation models), and on‑device AI agents (exploratory projects such as restaurant recommendation and daily‑task assistants). These use cases drive a rapid increase in multimodal data and impose new requirements on the underlying data infrastructure.
Data‑Lake Architecture Design
OPPO’s multimodal data lake is organized into four layers:
Compute Engine: Spark is used as the unified query engine, extended with the open‑source Lance 8K project for vector‑search capabilities.
Unified Metadata Management: Gravitino serves as the industry‑standard catalog, supporting Hive tables and Lance tables in a single namespace.
Acceleration Layer: The internally built, cloud‑native distributed cache file system Curvine mitigates I/O bottlenecks on OSS.
Platform Product Layer: Existing data‑map, permission, and governance services are reused to provide a unified data‑asset management experience.
Why Gravitino?
Two years prior to the presentation, OPPO’s data team faced massive PB‑scale data scattered across scripts with no ownership or governance. Gravitino was adopted to enforce a policy that any new directory must be registered in the catalog, gradually consolidating both incremental and existing metadata.
Gravitino provides three core capabilities:
Unified catalog for Hive and Lance tables, enabling joint management.
Multi‑cloud support (hybrid on‑prem + Alibaba Cloud) that makes data location transparent to workloads.
Global data‑asset visibility, including ownership, daily billing, and upstream/downstream dependencies.
Curvine: High‑Performance Distributed Cache
Curvine is OPPO’s open‑source cloud‑native cache file system. It offers two modes: a cache mode that mirrors OSS objects and an FS mode that presents OSS data with full POSIX semantics. Curvine also supports S3 and HDFS protocols and integrates natively with Kubernetes via CSI.
Typical usage patterns include:
Caching LanceDB indexes and manifest metadata for hot‑path access.
Pre‑warming frequently read tables by loading OSS data onto local cache disks.
Accelerating checkpoint writes during model training.
Speeding up vector queries on the LanceDB dataset, achieving performance comparable to the commercial LanceDB version.
Performance Comparison
Using the dbpedia-entities-openai-1M benchmark, OPPO compared three setups: (1) direct OSS storage, (2) community‑edition LanceDB on OSS, and (3) Curvine‑accelerated community‑edition LanceDB. The results showed that Curvine‑enabled LanceDB queries were almost on par with the commercial LanceDB offering, demonstrating that the open‑source cache reproduces the key performance advantage of the proprietary product.
Future Outlook
Curvine is evolving from a pure caching service to a broader data‑transformation middleware. Planned features include automatic conversion of incoming data to Lance format, index‑building services, and automatic small‑file merging to address the typical small‑file problem in multimodal data lakes. The project is open‑source (https://github.com/curvineio/curvine) and invites industry partners to contribute.
Overall, OPPO’s multimodal data‑lake implementation demonstrates that unified metadata (Gravitino) combined with a high‑performance cache layer (Curvine) can effectively resolve data silos, metadata chaos, and I/O bottlenecks, enabling scalable AI workloads on cloud infrastructure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
