Big Data 9 min read

Apache Gravitino: An Open‑Source Metadata Lake for Unified Data and AI Asset Management

Apache Gravitino is an open‑source metadata service platform that provides a unified, high‑performance, geographically distributed metadata lake, enabling end‑to‑end data governance, multi‑engine access, and direct management of both structured and unstructured data assets across diverse systems.

Big Data Technology & Architecture

May 16, 2025

Apache Gravitino: An Open‑Source Metadata Lake for Unified Data and AI Asset Management

Overview

Apache Gravitino is an open‑source metadata service platform designed to simplify metadata management from diverse sources, types, and regions. It offers high performance, geographic distribution, and federation features, providing a unified interface for accessing and managing data and AI asset metadata.

Positioning

With the rapid growth of data lakes, AI data, and heightened focus on data security and governance, existing architectures struggle to deliver a unified metadata management system, leading to data silos, fragmented permissions, and complex governance. Gravitino introduces a standard interface and metadata model to break these barriers, allowing users to centrally manage and access various metadata on a single platform.

Gravitino is positioned as a "Metadata Lake" and was initiated by Datastrato, open‑sourced in 2023.

Core Capabilities

Unified Metadata Management

Gravitino extracts a unified metadata model and API for different metadata sources. It supports relational metadata models for table data (e.g., Hive, MySQL, PostgreSQL, Apache Doris) and file metadata models for unstructured data (e.g., HDFS, S3, other formats).

End‑to‑End Data Governance

Gravitino provides a unified governance layer covering access control, auditing, discovery, and more, ensuring accuracy, completeness, security, and availability of data throughout its lifecycle.

Direct Metadata Management

Unlike traditional systems that collect metadata passively, Gravitino directly manages underlying systems via connectors, reflecting changes both ways for more efficient metadata handling.

Multi‑Engine Support

Gravitino currently supports Trino, Apache Spark, and Apache Flink for querying metadata without altering existing SQL dialects, and is extending support for AI asset management (models, features, etc.).

Core Architecture

Gravitino organizes metadata using the concept of a MetaLake, which contains multiple Catalogs, each representing a specific data source type (e.g., Hive, Iceberg, Hudi, MySQL, Postgres, Doris).

Within a Catalog, users can create Schemas (analogous to databases) to logically group underlying entities. Leaf nodes may be Tables, Filesets, Models, or Topics, each storing detailed metadata such as columns, partitions, storage locations, version info, or Kafka schema.

All metadata is stored in a backend store (MySQL, PostgreSQL, in‑memory, or KV store) and accessed via a unified RESTful API.

Gravitino also implements the Iceberg REST catalog open API, allowing clients to manage Iceberg tables using the standard protocol.

Function Layer: Provides APIs for standard metadata CRUD operations, access control, discovery, etc.

Interface Layer: Exposes a standard REST API; future support includes Thrift and JDBC.

Core Object Model: Defines a universal metadata model for heterogeneous sources.

Connection Layer: Offers connectors for various sources such as Hive, MySQL, PostgreSQL, and other non‑relational metadata.

Production Practice

Apache Gravitino has been adopted by many companies, including Xiaomi, Tencent, Bilibili, Flywheel, NetEase Games, Vipshop, and Beike.

Case Study: Bilibili

Bilibili faced high coupling between business services and heterogeneous data sources, limited governance capabilities, lack of management for semi‑structured/unstructured sources, and high costs for cross‑source schema maintenance.

Developed the OneMeta platform on top of Gravitino to unify metadata management.

Simplified data access by letting upper‑layer engines query through OneMeta.

Implemented governance via tags and TTL policies.

Results

Simplified integration layers and enhanced governance.

Reduced storage costs dramatically (≈100 PB HDFS EC and ≈300 PB HDFS TTL savings).

Improved data access efficiency and lowered system maintenance costs.

For more resources, join the community "300万字！全网最全大数据学习面试社区等你来".

Open Source Data Governance data lake metadata management Apache Gravitino

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.