Big Data 15 min read

Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase

The article outlines Xiaomi's data governance journey, focusing on storage‑service cost governance, describing the transition from simple cost‑centered governance to big‑data‑driven asset management, and detailing concrete HDFS and HBase practices that achieved significant resource and cost reductions.

DataFunTalk

Feb 18, 2023

Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase

Data governance is a broad topic; this article mainly introduces storage‑service related work, focusing on cost governance. It is divided into two major parts and four subsections.

Part 1: Evolution of Xiaomi's data governance – From a naive view where data governance equated to cost governance, to using big data to govern big data, achieving data assetization and measurability.

Naive data governance (2018) – Early efforts treated data governance simply as cost control, with limited value and little observability. As business grew, resource waste became noticeable, prompting dedicated effort.

Cost governance follows a "grab the big, drop the small" principle: identify the highest‑cost services and clusters, assign owners, and drive optimization.

Benefits include clear, simple, and efficient goals, limited manpower, and rapid results.

Part 2: Big‑data‑driven governance – Beyond cost, governance now addresses data quality, timeliness, and security, aiming for unified control through a three‑step process.

Step 1 – Build a metadata warehouse (meta‑store) that ingests metadata from Yarn, Hive, HBase, hosts, clusters, etc., providing a single source of truth for cost and utilization.

Step 2 – Define features (rules) to flag unreasonable usage such as unused tables, duplicate data, improper lifecycle settings, or missing owners. Features are co‑defined by service and business owners and run daily against the meta‑store.

Step 3 – Productize: calculate a health score for each dataset, display it on a web portal for stakeholders, and provide remediation suggestions (e.g., enable automatic cold backup).

Applying this to the storage platform resulted in a 23.8% reduction in host count and a 38.9% drop in host cost.

HDFS Governance Practice

Strategy: tiered storage (hot‑cold) using object storage for cold data. For overseas clusters, object storage is the cheapest option; a unified tiering solution is applied globally.

Implementation steps:

(1) Introduce object files in HDFS that store a list of object URIs instead of blocks.

(2) New files are initially block‑based on DataNodes (DN).

(3) A governance service scans for files to tier, marks them on the NameNode (NN), and a Spark job rewrites them as object files.

(4) The NN creates a new INode for the object file, replacing the original; the old INode is kept temporarily in a safe box.

Read path: normal reads go from NN to DN; tiered reads obtain a fake block ID and a proxy DN address, the proxy DN resolves the object URI and reads from object storage, applying bandwidth controls for domestic deployments.

Special cases such as transform failures, short‑circuit reads, and cache invalidation are handled with fallback logic.

Governance focuses on structured tables; unstructured data is only governed if it can be treated as a table.

Table‑level policy: tables newer than 93 days are exempt; older tables are evaluated for partitions, usage, and classified as renewable or non‑renewable. Based on TTV (target access period) and TTL (lifetime), tables are assigned hot, warm, or cold status and moved accordingly.

HBase Governance Practice

HBase is widely used for online workloads requiring low latency, typically on SSDs. Governance also follows hot‑cold separation, with cheaper storage options (HDD, tiering, EC, high‑density machines) for cold data.

Four scenarios are addressed:

Scenario 1 – High‑consistency backup and offline clusters: use tiering for HFile storage, keeping WAL with three replicas.

Scenario 2 – High‑availability backup: use erasure coding (EC) to retain performance comparable to the primary cluster.

Scenario 3 – Online tables (including time‑series): apply time‑based hot‑cold partitioning; use HDD domestically and tiering overseas.

Scenario 4/5 – Migration to offline or archival deletion: migrate write‑only tables to offline clusters with tiering; delete long‑inactive tables after archiving.

The HBase governance resulted in a 16.6% reduction in machines.

Overall, the presentation demonstrates Xiaomi's data governance evolution, the practical cost‑optimization techniques applied to HDFS and HBase, and the measurable resource savings achieved.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data HBase storage Data Governance HDFS

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.