Industry Insights 12 min read

How Apache Gravitino Solves Data Fragmentation in the Multi‑Cloud AI Era

In a Data for AI meetup, Datastrato's VP of Engineering Shi Shaofeng explains how Apache Gravitino's metadata federation, metalake architecture, and unified access control address multi‑cloud data fragmentation, compliance, and AI‑driven governance while outlining version 1.1.0 enhancements and the roadmap for 1.2.0.

DataFunSummit

Apr 20, 2026

How Apache Gravitino Solves Data Fragmentation in the Multi‑Cloud AI Era

Background and Problem Statement

Modern enterprises face severe data‑asset fragmentation caused by multi‑cloud deployments aimed at avoiding vendor lock‑in and by strict data‑sovereignty regulations that force geographically distributed storage. This physical isolation complicates data discovery for users, applications, and AI agents, and creates security blind spots for personally identifiable information (PII).

Metadata Federation as a Remedy

Shi Shaofeng proposes Apache Gravitino’s Metadata Federation design, which provides a logical, unified metadata entry point without moving physical data. Gravitino catalogs not only relational tables but also unstructured Filesets , AI Models , and streaming Topics , enabling a single view of heterogeneous assets.

Technical Architecture Decomposition

Gravitino adopts a layered, decoupled architecture:

Connection Layer : pluggable adapters for diverse storage engines.

Metalake Core : defines a standard object model and schema system, supporting interchangeable metadata stores.

Interface Layer : exposes a unified REST API consumed by engines such as Spark, Trino, and Flink, dramatically reducing application‑level storage complexity.

The system natively supports Iceberg REST API and Lance REST API, guaranteeing binary‑level compatibility with existing lakehouse ecosystems. For unstructured data, Gravitino introduces a Fileset abstraction and a virtual file system (GVFS) that lets compute frameworks access globally distributed corpora as if they were local files, masking protocol differences.

Unified Permission Control and Data‑AI Collaboration

Gravitino implements role‑based access control (RBAC) with push‑down enforcement to storage or real‑time checks at the API gateway, eliminating duplicated permission management across heterogeneous systems. In AI‑centric workflows, logical metadata sharing replaces costly physical data export: data teams publish processed datasets as Filesets, which AI frameworks (TensorFlow, PyTorch, Ray) can consume instantly, achieving zero‑copy collaboration and tighter security.

Version 1.1.0 Highlights

Release 1.1.0 adds multimodal AI storage support and security hardening. Key features include:

Integration of Lance REST, enabling efficient vector search and large‑scale corpus handling.

Full‑compliant Lance REST API for unified management of Iceberg, Hudi, Paimon, and Lance assets.

Generic Lakehouse Catalog that offers a uniform metadata interface for file‑system‑based lakehouse tables, improving cross‑backend discoverability.

Deep integration of Gravitino’s RBAC model into Iceberg’s protocol chain for fine‑grained permission checks.

Roadmap to Version 1.2.0

Upcoming 1.2.0 will introduce:

User‑Defined Function (UDF) support across Spark and AI engines.

Automated Table Maintenance Service (TMS) with rewrite and cleanup templates for Iceberg and Paimon.

Delta Lake deep integration via the generic catalog.

Enhanced ecosystem bindings, e.g., Daft’s native GVFS support through GravitinoConfig.

These additions aim to cement Gravitino’s role as an enterprise‑grade metadata governance foundation.

Community Vision

Since its 2023 open‑source launch, Gravitino has grown exponentially, graduating from the Apache Incubator to a Top‑Level Project within five months. The community plans to fuse large‑language‑model capabilities via a Model Context Protocol (MCP), exposing metadata semantics to AI agents for autonomous classification, policy‑driven optimization, and dynamic governance—signaling a shift from static catalogs to intelligent, self‑governing data platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multi-Cloud metadata management Apache Gravitino metadata lake AI data governance

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background and Problem Statement

Metadata Federation as a Remedy

Technical Architecture Decomposition

Unified Permission Control and Data‑AI Collaboration

Version 1.1.0 Highlights

Roadmap to Version 1.2.0

Community Vision

DataFunSummit

How this landed with the community

Was this worth your time?

0 Comments

Version 1.1.0 Highlights

Roadmap to Version 1.2.0