How Xiaomi Built a Scalable Metadata Platform for Data Governance
This article details Xiaomi's end‑to‑end metadata platform, covering its three‑layer architecture, the evolution of full‑domain metadata, real‑time lineage, precise measurement, and how these capabilities enable data map, governance, cost control, and quality improvements for future business empowerment.
Metadata Platform Architecture
The platform is organized into three logical layers:
Source layer : collects raw metadata and logs from Hive, Doris, Kudu, Iceberg, Elasticsearch, MySQL, and the Talos MQ bus.
Integration layer : Metacat normalizes metadata, handles both daily snapshots (T+1) and incremental changes, and forwards lineage events through the MQ to downstream services via a unified API.
Storage layer : core entity information is persisted in MySQL; snapshot tables are stored in Hive; lineage graphs are kept in JanusGraph; search and access‑control queries are served by Elasticsearch.
Full‑Domain Metadata
Beyond the original Hive‑centric view, the platform now ingests metadata from upstream business databases, messaging systems, and downstream stores (Doris, Kudu, Elasticsearch, Iceberg, MySQL). All sources are unified under a single Hive Metastore view, enabling end‑to‑end data lineage and governance across the entire data ecosystem.
Real‑Time Lineage
Initial HDFS‑log parsing produced inaccurate lineage due to many open operations. The platform switched to point‑in‑time tracing embedded in the MQ pipeline of each compute engine (Hive, Flink, Spark, Presto, Distcp). SQL proxy logs are merged with engine‑side traces to produce precise, real‑time lineage information.
Precise Measurement
To overcome binary "zero‑or‑one" access counts, the system tags data accesses at the source, aggregates HDFS‑Image and HDFS‑Log metrics, and reconciles them with SQL audit logs. This yields accurate per‑table and per‑field access frequencies for downstream cost and quality analyses.
Metadata‑Driven Applications
Data Map
Provides full‑domain search (tables, fields, descriptions, layers, classifications, tags, departments) across Hive, Talos, Doris, Kudu, Iceberg, Elasticsearch, and MySQL. Lineage visualization links source objects (e.g., MySQL → MQ → Hive → Iceberg → Doris) to downstream applications, enabling root‑cause tracing.
Data Governance
Measures two dimensions:
Modeling compliance : naming conventions, layer classification, and tag completeness.
Completeness : cross‑layer reference coverage and query‑coverage ratios.
Cost Governance
Implements a closed‑loop process: observe current spend, identify waste, apply optimizations, and measure savings. Key techniques include:
One‑hole storage accounting using HDFS‑Image to align logical volume with physical storage.
Daily billing for timely cost feedback.
User‑based attribution to map costs to data owners.
Real‑time cost estimation for any data operation.
Data Quality
Quality checks cover timeliness, uniqueness, accuracy, completeness, and consistency. The architecture uses:
Event‑triggered checks : after DAG execution, workers consume MQ events to validate newly produced tables.
Time‑triggered checks : scheduled workers query HDFS, Presto, Spark, and Doris to enforce static rules.
Stateless, horizontally scalable workers enable extensible rule sets.
Future Roadmap
The next phase focuses on three pillars:
Production‑grade resource scheduling : integrate baseline management, job execution, monitoring, and Yarn resource orchestration.
Long‑term metadata roadmap : define data‑health metrics, establish health‑assessment processes, and evolve governance mechanisms.
Business enablement : provide tools for cost analysis, quality assurance, and efficiency improvement to encourage data‑driven product adoption.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
