Apache Gravitino: An Open‑Source Metadata Lake for Unified Data and AI Asset Management
Apache Gravitino is an open‑source metadata service platform that provides a unified, high‑performance, geographically distributed metadata lake, enabling end‑to‑end data governance, multi‑engine access, and direct management of both structured and unstructured data assets across diverse systems.
Overview
Apache Gravitino is an open‑source metadata service platform designed to simplify metadata management from diverse sources, types, and regions. It offers high performance, geographic distribution, and federation features, providing a unified interface for accessing and managing data and AI asset metadata.
Positioning
With the rapid growth of data lakes, AI data, and heightened focus on data security and governance, existing architectures struggle to deliver a unified metadata management system, leading to data silos, fragmented permissions, and complex governance. Gravitino introduces a standard interface and metadata model to break these barriers, allowing users to centrally manage and access various metadata on a single platform.
Gravitino is positioned as a "Metadata Lake" and was initiated by Datastrato, open‑sourced in 2023.
Core Capabilities
Unified Metadata Management
Gravitino extracts a unified metadata model and API for different metadata sources. It supports relational metadata models for table data (e.g., Hive, MySQL, PostgreSQL, Apache Doris) and file metadata models for unstructured data (e.g., HDFS, S3, other formats).
End‑to‑End Data Governance
Gravitino provides a unified governance layer covering access control, auditing, discovery, and more, ensuring accuracy, completeness, security, and availability of data throughout its lifecycle.
Direct Metadata Management
Unlike traditional systems that collect metadata passively, Gravitino directly manages underlying systems via connectors, reflecting changes both ways for more efficient metadata handling.
Multi‑Engine Support
Gravitino currently supports Trino, Apache Spark, and Apache Flink for querying metadata without altering existing SQL dialects, and is extending support for AI asset management (models, features, etc.).
Core Architecture
Gravitino organizes metadata using the concept of a MetaLake, which contains multiple Catalogs, each representing a specific data source type (e.g., Hive, Iceberg, Hudi, MySQL, Postgres, Doris).
Within a Catalog, users can create Schemas (analogous to databases) to logically group underlying entities. Leaf nodes may be Tables, Filesets, Models, or Topics, each storing detailed metadata such as columns, partitions, storage locations, version info, or Kafka schema.
All metadata is stored in a backend store (MySQL, PostgreSQL, in‑memory, or KV store) and accessed via a unified RESTful API.
Gravitino also implements the Iceberg REST catalog open API, allowing clients to manage Iceberg tables using the standard protocol.
Function Layer: Provides APIs for standard metadata CRUD operations, access control, discovery, etc.
Interface Layer: Exposes a standard REST API; future support includes Thrift and JDBC.
Core Object Model: Defines a universal metadata model for heterogeneous sources.
Connection Layer: Offers connectors for various sources such as Hive, MySQL, PostgreSQL, and other non‑relational metadata.
Production Practice
Apache Gravitino has been adopted by many companies, including Xiaomi, Tencent, Bilibili, Flywheel, NetEase Games, Vipshop, and Beike.
Case Study: Bilibili
Bilibili faced high coupling between business services and heterogeneous data sources, limited governance capabilities, lack of management for semi‑structured/unstructured sources, and high costs for cross‑source schema maintenance.
Developed the OneMeta platform on top of Gravitino to unify metadata management.
Simplified data access by letting upper‑layer engines query through OneMeta.
Implemented governance via tags and TTL policies.
Results
Simplified integration layers and enhanced governance.
Reduced storage costs dramatically (≈100 PB HDFS EC and ≈300 PB HDFS TTL savings).
Improved data access efficiency and lowered system maintenance costs.
For more resources, join the community "300万字!全网最全大数据学习面试社区等你来".
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
