How Apache Gravitino Solves Data Fragmentation in the Multi‑Cloud AI Era
In a Data for AI meetup, Datastrato's VP of Engineering Shi Shaofeng explains how Apache Gravitino's metadata federation, metalake architecture, and unified access control address multi‑cloud data fragmentation, compliance, and AI‑driven governance while outlining version 1.1.0 enhancements and the roadmap for 1.2.0.
Background and Problem Statement
Modern enterprises face severe data‑asset fragmentation caused by multi‑cloud deployments aimed at avoiding vendor lock‑in and by strict data‑sovereignty regulations that force geographically distributed storage. This physical isolation complicates data discovery for users, applications, and AI agents, and creates security blind spots for personally identifiable information (PII).
Metadata Federation as a Remedy
Shi Shaofeng proposes Apache Gravitino’s Metadata Federation design, which provides a logical, unified metadata entry point without moving physical data. Gravitino catalogs not only relational tables but also unstructured Filesets , AI Models , and streaming Topics , enabling a single view of heterogeneous assets.
Technical Architecture Decomposition
Gravitino adopts a layered, decoupled architecture:
Connection Layer : pluggable adapters for diverse storage engines.
Metalake Core : defines a standard object model and schema system, supporting interchangeable metadata stores.
Interface Layer : exposes a unified REST API consumed by engines such as Spark, Trino, and Flink, dramatically reducing application‑level storage complexity.
The system natively supports Iceberg REST API and Lance REST API, guaranteeing binary‑level compatibility with existing lakehouse ecosystems. For unstructured data, Gravitino introduces a Fileset abstraction and a virtual file system (GVFS) that lets compute frameworks access globally distributed corpora as if they were local files, masking protocol differences.
Unified Permission Control and Data‑AI Collaboration
Gravitino implements role‑based access control (RBAC) with push‑down enforcement to storage or real‑time checks at the API gateway, eliminating duplicated permission management across heterogeneous systems. In AI‑centric workflows, logical metadata sharing replaces costly physical data export: data teams publish processed datasets as Filesets, which AI frameworks (TensorFlow, PyTorch, Ray) can consume instantly, achieving zero‑copy collaboration and tighter security.
Version 1.1.0 Highlights
Release 1.1.0 adds multimodal AI storage support and security hardening. Key features include:
Integration of Lance REST, enabling efficient vector search and large‑scale corpus handling.
Full‑compliant Lance REST API for unified management of Iceberg, Hudi, Paimon, and Lance assets.
Generic Lakehouse Catalog that offers a uniform metadata interface for file‑system‑based lakehouse tables, improving cross‑backend discoverability.
Deep integration of Gravitino’s RBAC model into Iceberg’s protocol chain for fine‑grained permission checks.
Roadmap to Version 1.2.0
Upcoming 1.2.0 will introduce:
User‑Defined Function (UDF) support across Spark and AI engines.
Automated Table Maintenance Service (TMS) with rewrite and cleanup templates for Iceberg and Paimon.
Delta Lake deep integration via the generic catalog.
Enhanced ecosystem bindings, e.g., Daft’s native GVFS support through GravitinoConfig.
These additions aim to cement Gravitino’s role as an enterprise‑grade metadata governance foundation.
Community Vision
Since its 2023 open‑source launch, Gravitino has grown exponentially, graduating from the Apache Incubator to a Top‑Level Project within five months. The community plans to fuse large‑language‑model capabilities via a Model Context Protocol (MCP), exposing metadata semantics to AI agents for autonomous classification, policy‑driven optimization, and dynamic governance—signaling a shift from static catalogs to intelligent, self‑governing data platforms.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
