Fundamentals 20 min read

How Xiaomi Built a Scalable Metadata Platform for Data Governance

This article details Xiaomi's end‑to‑end metadata platform, covering its three‑layer architecture, the evolution of full‑domain metadata, real‑time lineage, precise measurement, and how these capabilities enable data map, governance, cost control, and quality improvements for future business empowerment.

dbaplus Community

Dec 22, 2021

How Xiaomi Built a Scalable Metadata Platform for Data Governance

Metadata Platform Architecture

The platform is organized into three logical layers:

Source layer : collects raw metadata and logs from Hive, Doris, Kudu, Iceberg, Elasticsearch, MySQL, and the Talos MQ bus.

Integration layer : Metacat normalizes metadata, handles both daily snapshots (T+1) and incremental changes, and forwards lineage events through the MQ to downstream services via a unified API.

Storage layer : core entity information is persisted in MySQL; snapshot tables are stored in Hive; lineage graphs are kept in JanusGraph; search and access‑control queries are served by Elasticsearch.

Full‑Domain Metadata

Beyond the original Hive‑centric view, the platform now ingests metadata from upstream business databases, messaging systems, and downstream stores (Doris, Kudu, Elasticsearch, Iceberg, MySQL). All sources are unified under a single Hive Metastore view, enabling end‑to‑end data lineage and governance across the entire data ecosystem.

Real‑Time Lineage

Initial HDFS‑log parsing produced inaccurate lineage due to many open operations. The platform switched to point‑in‑time tracing embedded in the MQ pipeline of each compute engine (Hive, Flink, Spark, Presto, Distcp). SQL proxy logs are merged with engine‑side traces to produce precise, real‑time lineage information.

Precise Measurement

To overcome binary "zero‑or‑one" access counts, the system tags data accesses at the source, aggregates HDFS‑Image and HDFS‑Log metrics, and reconciles them with SQL audit logs. This yields accurate per‑table and per‑field access frequencies for downstream cost and quality analyses.

Metadata‑Driven Applications

Data Map

Provides full‑domain search (tables, fields, descriptions, layers, classifications, tags, departments) across Hive, Talos, Doris, Kudu, Iceberg, Elasticsearch, and MySQL. Lineage visualization links source objects (e.g., MySQL → MQ → Hive → Iceberg → Doris) to downstream applications, enabling root‑cause tracing.

Data Governance

Measures two dimensions:

Modeling compliance : naming conventions, layer classification, and tag completeness.

Completeness : cross‑layer reference coverage and query‑coverage ratios.

Cost Governance

Implements a closed‑loop process: observe current spend, identify waste, apply optimizations, and measure savings. Key techniques include:

One‑hole storage accounting using HDFS‑Image to align logical volume with physical storage.

Daily billing for timely cost feedback.

User‑based attribution to map costs to data owners.

Real‑time cost estimation for any data operation.

Data Quality

Quality checks cover timeliness, uniqueness, accuracy, completeness, and consistency. The architecture uses:

Event‑triggered checks : after DAG execution, workers consume MQ events to validate newly produced tables.

Time‑triggered checks : scheduled workers query HDFS, Presto, Spark, and Doris to enforce static rules.

Stateless, horizontally scalable workers enable extensible rule sets.

Future Roadmap

The next phase focuses on three pillars:

Production‑grade resource scheduling : integrate baseline management, job execution, monitoring, and Yarn resource orchestration.

Long‑term metadata roadmap : define data‑health metrics, establish health‑assessment processes, and evolve governance mechanisms.

Business enablement : provide tools for cost analysis, quality assurance, and efficiency improvement to encourage data‑driven product adoption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Metadata Data Quality Data Governance Xiaomi

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.