Big Data 17 min read

Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance

This article details Didi's comprehensive big‑data asset governance platform, covering its architectural layers, Hadoop and Elasticsearch governance practices, health‑score models, lifecycle recommendations, and future plans for automated and intelligent data governance to reduce cost and manual effort.

DataFunTalk

Mar 24, 2024

Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance

Data asset governance is a critical component of big‑data applications, enabling cost reduction, efficiency gains, and higher data utilization; platform‑based tools are essential for effective governance.

The presentation is organized into five parts: (1) Didi's big‑data asset management platform, (2) Hadoop governance practice, (3) Elasticsearch (ES) governance practice, (4) future planning, and (5) a Q&A session.

The Didi data system consists of a data foundation layer (storage, compute, real‑time query, messaging, scheduling), a middle “data dream factory” layer offering one‑stop development tools, and an upper data service/application layer, with the asset management platform serving as a unified management hub for storage (Hive, HDFS) and compute (Spark) resources.

Platform functions are divided into cost management, asset management, and asset governance, the latter covering storage, compute, quality, and security governance, supported by tools such as a governance workbench, automated governance, and permission recycling.

Hadoop governance focuses on storage objects (Hive tables, HDFS paths) and compute objects (Spark and MapReduce tasks). Its architecture includes a metadata layer, a data‑model layer, and a governance‑application layer, providing health scores and governance items like data skew, brute‑scan, and lifecycle issues.

Data skew is identified through engine‑log analysis, skew‑rate calculation, and threshold checks; typical mitigation includes hotspot‑key handling and broadcast joins for large‑small table joins. Brute‑scan detection relies on partition‑scan counts and data volume, with recommendations to tighten partition filters or avoid implicit joins.

Health scores aggregate deductions from governance items, producing separate storage and compute scores; deductions are weighted by impact on cost, and influence factors (e.g., recent compute consumption) adjust scores across dimensions such as personal, project, and account views.

ES governance mirrors Hadoop practices, targeting index templates. Governance items include empty templates, unreasonable lifecycle, field optimization, and unused fields. Lifecycle recommendations are derived from 33‑day access span analysis, while field optimization disables unused forward and inverted indexes based on access logs.

The platform presents governance items, associated actions, and health‑score dashboards, enabling users to prioritize high‑impact tasks.

Future plans aim to further automate governance, improve recommendation accuracy, and introduce budget‑control mechanisms to shift governance upstream and reduce manual effort.

The session concludes with a Q&A covering topics such as skew‑rate constants, lifecycle recommendation logic, entry‑level data‑governance guidance, deletion processes, formula applicability, user satisfaction, governance factor calculation, and strategies for organizations with limited technical resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Elasticsearch Data Governance Hadoop asset management Health Score

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.