Big Data 20 min read

Building an Intelligent Data Governance System with Active Metadata: JD Retail Experience

This article details JD Retail's experience building an intelligent data governance system using proactive metadata, covering challenges in data management, the design of a comprehensive governance framework, lifecycle evaluation models, and automated data backfill solutions to improve efficiency, cost control, and operational safety.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Building an Intelligent Data Governance System with Active Metadata: JD Retail Experience

01 Data Management Challenges

JD Retail faces multiple data management challenges: rapid data growth leads to inefficient and redundant data models, high maintenance costs, and poor data quality; shared accounts for data development and management cause change‑management risks; and the increasing number of tables and storage size intensifies compute and storage consumption.

02 Data Governance System Construction

2.1 Governance Approach

The overall approach tackles data standards, architecture, development, and cost by using technology to drive end‑to‑end data chain efficiency.

Define and systematize data standards, certify assets.

Upgrade data architecture agilely to support business goals.

Isolate development and production for data safety.

Build storage‑compute governance to reduce operational costs.

2.2 Governance Framework

Standard governance establishes a unified retail data language, describing models with elements such as business domain, process, attributes, update frequency, and granularity. Assets are certified and high‑value models are cataloged, while low‑quality models are decommissioned.

3 Architecture Governance

Logical virtual tables replace wide physical tables, abstracting models into dimensions and metrics for agility. Intelligent materialization (HBO, CBO, RBO) automatically decides which tables to pre‑materialize, reducing manual effort and resource consumption.

4 Development Governance

Production‑development isolation secures data pipelines.

5 Resource Governance

Storage governance includes table lifecycle, invalid table identification, and data compression; compute governance optimizes task execution, reduces idle resources, and supports peak‑shaving.

Active metadata is mined to build governance models and visual dashboards, enabling data‑driven, objective, and automated recommendations.

03 Proactive Metadata Governance Practices

1. What is proactive metadata? It is continuously accessible, processed, and analyzed metadata that auto‑generates and updates, supporting intelligent analysis and decision‑making.

2. Core capabilities of proactive metadata tools include clustering analysis, resource diagnosis, alerting, and recommendation (as described by Gardener).

2. Storage Governance Challenges

Lack of data support for partition consumption and cost.

High evaluation cost for >200k models.

Reluctance to govern due to risk and effort.

Historical data must be retained, limiting deletion.

Solution: a cost‑based intelligent lifecycle evaluation system that quantifies storage and compute costs, recommends optimal lifecycles, and visualizes recommendations.

3 Intelligent Lifecycle Evaluation System

Defines lifecycle as the time from data write to deletion. A cost model balances storage and compute expenses to find the equilibrium point for each model, considering layer, selection, certification, task level, and processing time.

4 Intelligent Lifecycle Productization

Automatic recommendation for tens of thousands of models, identifying hundreds of PB of governance space.

High acceptance (>70%) and >100 PB of storage governance completed.

Integrated into the big‑data platform for enterprise‑wide use.

5 Data Backfill Challenges

Manual backfill requires extensive coordination and consumes ~18% of compute resources. An automated solution aims to detect missing partitions, orchestrate backfill topology using data lineage, batch execution, and dynamic resource coordination.

6 Smart Backfill Architecture

The architecture leverages production lineage (table and task dependencies) to sense missing partitions, plan backfill order, and execute optimized batches, reducing manual effort and resource usage. Expected rollout in Q2.

07 Summary and Future Outlook

The presentation covered three parts: (1) proactive metadata‑driven data‑fabric governance, (2) lineage‑based intelligent backfill, and (3) logical modeling with smart materialization. Future work includes more automation, AI‑driven task optimization, semantic entity recognition for asset certification, and turning governance experience into systematic, AI‑enhanced capabilities.

Q&A

Q1: Role of metadata in governance and biggest challenges for developers? A1: Metadata provides cost, execution, and usage insights; challenges are lack of guidance on lifecycle settings and time constraints.

Q2: Constraints of proactive metadata? A2: Accuracy is critical; proactive metadata must be trustworthy before it can drive recommendations.

For more details, see the linked articles and QR codes at the end of the original document.

Big DataLifecycle Managementdata governanceActive Metadatadata fabric
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.