Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase
The article outlines Xiaomi's data governance journey, focusing on storage‑service cost governance, describing the transition from simple cost‑centered governance to big‑data‑driven asset management, and detailing concrete HDFS and HBase practices that achieved significant resource and cost reductions.
Data governance is a broad topic; this article mainly introduces storage‑service related work, focusing on cost governance. It is divided into two major parts and four subsections.
Part 1: Evolution of Xiaomi's data governance – From a naive view where data governance equated to cost governance, to using big data to govern big data, achieving data assetization and measurability.
Naive data governance (2018) – Early efforts treated data governance simply as cost control, with limited value and little observability. As business grew, resource waste became noticeable, prompting dedicated effort.
Cost governance follows a "grab the big, drop the small" principle: identify the highest‑cost services and clusters, assign owners, and drive optimization.
Benefits include clear, simple, and efficient goals, limited manpower, and rapid results.
Part 2: Big‑data‑driven governance – Beyond cost, governance now addresses data quality, timeliness, and security, aiming for unified control through a three‑step process.
Step 1 – Build a metadata warehouse (meta‑store) that ingests metadata from Yarn, Hive, HBase, hosts, clusters, etc., providing a single source of truth for cost and utilization.
Step 2 – Define features (rules) to flag unreasonable usage such as unused tables, duplicate data, improper lifecycle settings, or missing owners. Features are co‑defined by service and business owners and run daily against the meta‑store.
Step 3 – Productize: calculate a health score for each dataset, display it on a web portal for stakeholders, and provide remediation suggestions (e.g., enable automatic cold backup).
Applying this to the storage platform resulted in a 23.8% reduction in host count and a 38.9% drop in host cost.
HDFS Governance Practice
Strategy: tiered storage (hot‑cold) using object storage for cold data. For overseas clusters, object storage is the cheapest option; a unified tiering solution is applied globally.
Implementation steps:
(1) Introduce object files in HDFS that store a list of object URIs instead of blocks.
(2) New files are initially block‑based on DataNodes (DN).
(3) A governance service scans for files to tier, marks them on the NameNode (NN), and a Spark job rewrites them as object files.
(4) The NN creates a new INode for the object file, replacing the original; the old INode is kept temporarily in a safe box.
Read path: normal reads go from NN to DN; tiered reads obtain a fake block ID and a proxy DN address, the proxy DN resolves the object URI and reads from object storage, applying bandwidth controls for domestic deployments.
Special cases such as transform failures, short‑circuit reads, and cache invalidation are handled with fallback logic.
Governance focuses on structured tables; unstructured data is only governed if it can be treated as a table.
Table‑level policy: tables newer than 93 days are exempt; older tables are evaluated for partitions, usage, and classified as renewable or non‑renewable. Based on TTV (target access period) and TTL (lifetime), tables are assigned hot, warm, or cold status and moved accordingly.
HBase Governance Practice
HBase is widely used for online workloads requiring low latency, typically on SSDs. Governance also follows hot‑cold separation, with cheaper storage options (HDD, tiering, EC, high‑density machines) for cold data.
Four scenarios are addressed:
Scenario 1 – High‑consistency backup and offline clusters: use tiering for HFile storage, keeping WAL with three replicas.
Scenario 2 – High‑availability backup: use erasure coding (EC) to retain performance comparable to the primary cluster.
Scenario 3 – Online tables (including time‑series): apply time‑based hot‑cold partitioning; use HDD domestically and tiering overseas.
Scenario 4/5 – Migration to offline or archival deletion: migrate write‑only tables to offline clusters with tiering; delete long‑inactive tables after archiving.
The HBase governance resulted in a 16.6% reduction in machines.
Overall, the presentation demonstrates Xiaomi's data governance evolution, the practical cost‑optimization techniques applied to HDFS and HBase, and the measurable resource savings achieved.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.