Big Data 15 min read

JD.com Data Governance: Architecture, Key Technologies, and Future Directions

JD.com’s data‑governance framework combines a health‑score‑driven, automated platform that cross‑verifies audit logs, builds full‑link and operator‑level lineage, introduces standard fields, and optimizes resource mixing, task staggering, and cross‑datacenter scheduling, while targeting real‑time AI‑enhanced detection and full automation.

JD Retail Technology
JD Retail Technology
JD Retail Technology
JD.com Data Governance: Architecture, Key Technologies, and Future Directions

Introduction

The author, Jia Jiankun, presented at DataFun Summit 2024, sharing JD.com’s exploration and practice in large‑scale data governance. The talk addressed the essential conditions for effective governance, the transition from ad‑hoc (“exercise‑style”) governance to a normalized, continuous model, and ways to reduce governance costs.

01 Background and Solution

In today’s data‑driven era, data is a critical production factor, and JD.com’s data infrastructure spans tens of thousands of servers, exabytes of storage, millions of data models, and millions of daily tasks, with annual costs in the double‑digit millions. This massive cost pressure makes systematic governance inevitable.

Key challenges include:

Complex scenarios and constantly evolving control rules, leading to legacy jobs that bypass platform tools and write directly to HDFS or Spark, making audit and lineage difficult.

Large numbers of platform users with low awareness of governance costs and limited willingness to engage in proactive governance.

High manual governance cost and risk, where incorrect human judgment can cause production incidents.

Governance Approach

JD.com introduced a “health score” and a monetized billing model to quantify governance benefits, helping users perceive the impact of governance. An automated governance platform was built to discover issues, notify users, execute one‑click remediation, and evaluate benefits through quantitative metrics, thereby improving governance efficiency.

The platform addresses governance from several angles:

Cross‑verification of multiple data sources (HDFS audit logs, Hive audit logs, HDFS metadata, lineage data) to avoid false positives.

Multi‑stage validation that aggregates diagnostic results over consecutive days to filter out abnormal spikes.

Real‑time job validation that adds a second check for tasks selected for governance, mitigating delays from offline models.

Reversible operations with automatic backups, enabling one‑click rollback.

Governance mechanisms such as dedicated data‑management teams and clear role responsibilities.

Clear, hierarchical goals broken down by business unit, department, quarter, and month, tracked through regular meetings.

Incentive and penalty mechanisms to encourage good governance practices.

The current system covers cost, stability, security, and quality across dozens of governance items. Examples include lifecycle management for tables based on actual access patterns, dependency‑missing checks to prevent task failures, security‑level tagging, and metadata quality monitoring.

02 Key Technologies

Audit Logs

Audit logs record who accessed which data, when, where, and how—forming the foundation of security governance. To identify downstream usage, task IDs must be captured, and responsible owners must be traceable. JD.com customized the underlying APIs to attach task source and ID information, and performed content reverse‑engineering to distinguish read/write/DDL operations from raw metastore logs. By combining Hive audit logs (table‑level) with HDFS audit logs (partition‑level), JD.com can infer partition access frequency and recommend appropriate data lifecycles.

Full‑Link Lineage

The platform builds a complete data flow graph covering JDQ (Kafka‑based messaging), JRC (Flink‑based real‑time processing), DTS (data integration), and plumber in/out (data import/export). This graph enables impact analysis, link optimization, and operator‑level lineage. While table‑level lineage is useful, operator‑level (field‑level) lineage provides finer granularity, distinguishing direct field references from transformed expressions, which is essential for tasks such as duplicate storage detection.

Operator‑level lineage is derived from logical execution plan optimization and a Hive hook, delivering raw, understandable lineage for users.

Standard Field Concept

To simplify data discovery, JD.com introduced “standard fields” as abstract representations of business fields. Metadata such as calculation logic, usage instructions, enumeration values, and format constraints are attached to these standard fields. When linked to physical table fields, users can locate data by field semantics rather than by table name alone. The system automatically validates real data against these definitions and notifies anomalies without manual configuration.

03 From “Throttling” to “Open Resource Utilization”

Beyond cost‑saving (“throttling”), JD.com pursues “open” strategies to maximize resource utilization without additional expense. Three main techniques are employed:

Resource Mixing : During peak online‑service periods (e.g., Double‑11, 618), online resources are scarce, while offline batch workloads run at 70‑80% utilization year‑round. After peak hours, online resources can be borrowed by offline jobs, and vice‑versa during off‑peak nights.

Task Staggering : Millions of daily jobs exhibit temporal clustering (e.g., 30% of jobs run between 03:00‑05:00). By predicting queue load and task importance, JD.com dynamically adjusts execution windows to smooth demand, improving both utilization and latency.

Cross‑Datacenter Scheduling : To achieve high availability, JD.com operates a “two‑site‑three‑center” architecture. By dynamically relocating tasks to less‑busy datacenters, compute and storage loads are balanced, while accounting for inter‑datacenter bandwidth and storage constraints.

Key enablers for these techniques include compute‑storage separation, containerization of offline services, and multi‑layer resource isolation (CPU, network, etc.).

04 Future Outlook

JD.com plans to advance data governance in three directions:

Real‑time detection and remediation, moving from offline models to proactive, pre‑deployment interception.

Intelligent governance, leveraging AI to improve problem identification accuracy.

Full automation, aiming for a managed, unattended governance model.

The presentation concluded with references to related technical articles and a call to join JD Retail’s technical community.

Big Dataresource optimizationdata lineagedata governanceAudit LogsJD.com
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.