Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance
This article details Didi's comprehensive big‑data asset governance platform, covering its architectural layers, Hadoop and Elasticsearch governance practices, health‑score models, lifecycle recommendations, and future plans for automated and intelligent data governance to reduce cost and manual effort.
Data asset governance is a critical component of big‑data applications, enabling cost reduction, efficiency gains, and higher data utilization; platform‑based tools are essential for effective governance.
The presentation is organized into five parts: (1) Didi's big‑data asset management platform, (2) Hadoop governance practice, (3) Elasticsearch (ES) governance practice, (4) future planning, and (5) a Q&A session.
The Didi data system consists of a data foundation layer (storage, compute, real‑time query, messaging, scheduling), a middle “data dream factory” layer offering one‑stop development tools, and an upper data service/application layer, with the asset management platform serving as a unified management hub for storage (Hive, HDFS) and compute (Spark) resources.
Platform functions are divided into cost management, asset management, and asset governance, the latter covering storage, compute, quality, and security governance, supported by tools such as a governance workbench, automated governance, and permission recycling.
Hadoop governance focuses on storage objects (Hive tables, HDFS paths) and compute objects (Spark and MapReduce tasks). Its architecture includes a metadata layer, a data‑model layer, and a governance‑application layer, providing health scores and governance items like data skew, brute‑scan, and lifecycle issues.
Data skew is identified through engine‑log analysis, skew‑rate calculation, and threshold checks; typical mitigation includes hotspot‑key handling and broadcast joins for large‑small table joins. Brute‑scan detection relies on partition‑scan counts and data volume, with recommendations to tighten partition filters or avoid implicit joins.
Health scores aggregate deductions from governance items, producing separate storage and compute scores; deductions are weighted by impact on cost, and influence factors (e.g., recent compute consumption) adjust scores across dimensions such as personal, project, and account views.
ES governance mirrors Hadoop practices, targeting index templates. Governance items include empty templates, unreasonable lifecycle, field optimization, and unused fields. Lifecycle recommendations are derived from 33‑day access span analysis, while field optimization disables unused forward and inverted indexes based on access logs.
The platform presents governance items, associated actions, and health‑score dashboards, enabling users to prioritize high‑impact tasks.
Future plans aim to further automate governance, improve recommendation accuracy, and introduce budget‑control mechanisms to shift governance upstream and reduce manual effort.
The session concludes with a Q&A covering topics such as skew‑rate constants, lifecycle recommendation logic, entry‑level data‑governance guidance, deletion processes, formula applicability, user satisfaction, governance factor calculation, and strategies for organizations with limited technical resources.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.