Fundamentals 20 min read

How Xiaomi Built a Scalable Metadata Platform for Data Governance

This article details Xiaomi's end‑to‑end metadata platform, covering its three‑layer architecture, the evolution of full‑domain metadata, real‑time lineage, precise measurement, and how these capabilities enable data map, governance, cost control, and quality improvements for future business empowerment.

dbaplus Community
dbaplus Community
dbaplus Community
How Xiaomi Built a Scalable Metadata Platform for Data Governance

Metadata Platform Architecture

The platform is organized into three logical layers:

Source layer : collects raw metadata and logs from Hive, Doris, Kudu, Iceberg, Elasticsearch, MySQL, and the Talos MQ bus.

Integration layer : Metacat normalizes metadata, handles both daily snapshots (T+1) and incremental changes, and forwards lineage events through the MQ to downstream services via a unified API.

Storage layer : core entity information is persisted in MySQL; snapshot tables are stored in Hive; lineage graphs are kept in JanusGraph; search and access‑control queries are served by Elasticsearch.

Metadata platform architecture
Metadata platform architecture

Full‑Domain Metadata

Beyond the original Hive‑centric view, the platform now ingests metadata from upstream business databases, messaging systems, and downstream stores (Doris, Kudu, Elasticsearch, Iceberg, MySQL). All sources are unified under a single Hive Metastore view, enabling end‑to‑end data lineage and governance across the entire data ecosystem.

Full‑domain metadata implementation
Full‑domain metadata implementation

Real‑Time Lineage

Initial HDFS‑log parsing produced inaccurate lineage due to many open operations. The platform switched to point‑in‑time tracing embedded in the MQ pipeline of each compute engine (Hive, Flink, Spark, Presto, Distcp). SQL proxy logs are merged with engine‑side traces to produce precise, real‑time lineage information.

Real‑time lineage architecture
Real‑time lineage architecture

Precise Measurement

To overcome binary "zero‑or‑one" access counts, the system tags data accesses at the source, aggregates HDFS‑Image and HDFS‑Log metrics, and reconciles them with SQL audit logs. This yields accurate per‑table and per‑field access frequencies for downstream cost and quality analyses.

Precise measurement diagram
Precise measurement diagram

Metadata‑Driven Applications

Data Map

Provides full‑domain search (tables, fields, descriptions, layers, classifications, tags, departments) across Hive, Talos, Doris, Kudu, Iceberg, Elasticsearch, and MySQL. Lineage visualization links source objects (e.g., MySQL → MQ → Hive → Iceberg → Doris) to downstream applications, enabling root‑cause tracing.

Data map search results
Data map search results
Data map lineage
Data map lineage

Data Governance

Measures two dimensions:

Modeling compliance : naming conventions, layer classification, and tag completeness.

Completeness : cross‑layer reference coverage and query‑coverage ratios.

Governance metrics
Governance metrics

Cost Governance

Implements a closed‑loop process: observe current spend, identify waste, apply optimizations, and measure savings. Key techniques include:

One‑hole storage accounting using HDFS‑Image to align logical volume with physical storage.

Daily billing for timely cost feedback.

User‑based attribution to map costs to data owners.

Real‑time cost estimation for any data operation.

Cost governance results
Cost governance results

Data Quality

Quality checks cover timeliness, uniqueness, accuracy, completeness, and consistency. The architecture uses:

Event‑triggered checks : after DAG execution, workers consume MQ events to validate newly produced tables.

Time‑triggered checks : scheduled workers query HDFS, Presto, Spark, and Doris to enforce static rules.

Stateless, horizontally scalable workers enable extensible rule sets.

Quality architecture
Quality architecture

Future Roadmap

The next phase focuses on three pillars:

Production‑grade resource scheduling : integrate baseline management, job execution, monitoring, and Yarn resource orchestration.

Long‑term metadata roadmap : define data‑health metrics, establish health‑assessment processes, and evolve governance mechanisms.

Business enablement : provide tools for cost analysis, quality assurance, and efficiency improvement to encourage data‑driven product adoption.

Future planning diagram
Future planning diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

metadataData QualityData GovernanceXiaomi
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.