Big Data 19 min read

How JD’s Data Lake Uses Hudi LSM‑Tree to Power Near‑Real‑Time Data Assets

The article details JD’s data lake architecture, its 500 PB scale, self‑developed Hudi extensions—including LSM‑Tree‑based MoR tables, custom indexing, IO optimizations, Flink stream scheduling, and NativeIO SDK—along with benchmarks, community contributions, and future roadmap for real‑time big‑data processing.

JD Retail Technology
JD Retail Technology
JD Retail Technology
How JD’s Data Lake Uses Hudi LSM‑Tree to Power Near‑Real‑Time Data Assets

Data Lake Overview

JD’s data lake stores more than 500 PB of ingested data on distributed storage systems such as HDFS and ChubaoFS. It relies on Apache Hudi 0.13.1 for high‑performance read/write and file organization, ingesting streams, binlogs, Hive tables and other sources to provide a full lifecycle (table creation, ingestion, development, scheduling, query, quality control and table‑to‑Hive replacement).

Self‑Developed Technical Features

Organization Protocol Layer

Partition‑Level Bucket Index : JD adds a bucket index per partition to limit the number of buckets, mitigating data skew and reducing the list‑operation cost during log‑file writes.

Foreign‑Key Index : A custom index that maps foreign‑key values to primary‑key values, enabling stream‑based foreign‑key joins without materialising large state.

Multiple‑Ts Merge Engine : Extends the payload‑based merge logic with multi‑field (Multiple‑Ts) support and enhances Partial‑Update handling.

LSM‑Tree‑Based MoR Format : Replaces the community Avro log files with an LSM‑Tree organization where all log files are stored as Parquet. This yields 2‑10× read/write speed improvements.

Lock‑Free Concurrency Control : The LSM‑Tree protocol provides lock‑free concurrent updates, avoiding OCC/MVCC overhead.

Hybrid File Layout : Data is written first to a fast‑access buffer layer (e.g., HDFS or ChubaoFS) and later merged into a persistent layer, presenting a unified view to readers.

Incremental Table Service : Table‑service logic is incremental, preventing latency growth as partition count increases.

IO Transmission Layer

Serialization overhead is reduced by copying binary streams directly (Clustering), bypassing row‑column conversion with engine‑native formats, and applying ZSTD compression to lower CPU cost.

LSM‑Tree Design Principles

Each FileGroup is organised as a two‑level LSM‑Tree. New writes go to L0; a fast Minor Compaction merges small L0 files, while a periodic Major Compaction consolidates all files into a single L1 file, reducing merge frequency.

Minor Compaction : Quickly merges small L0 files to control file count.

Major Compaction : Periodically creates a single L1 file for global consistency.

Read Path Enhancements

A state‑machine‑driven loser‑tree (a variant of a loser tree) performs multi‑way merge. It avoids duplicate adjustments, batches identical keys, and eliminates deep copies, delivering roughly a 15 % read‑performance gain.

Write Path Design

Flink streaming writes consist of four stages: repartition , sorting , deduplication and IO . A Remote Partitioner balances bucket distribution across tasks, and a Disruptor‑based asynchronous ring buffer decouples production from consumption, improving throughput.

Compaction Optimizations

Incremental Compaction schedules only partitions that received new data, tracks missing partitions and instants, and skips unnecessary scans. Flink stream scheduling moves compaction logic into a dedicated TaskManager operator, removing the JobManager bottleneck.

Benchmark

Evaluations on TPC‑DS and Nexmark compare three table formats: JD‑Hudi MoR‑LSM, JD‑Hudi MoR‑Avro (baseline) and Paimon PK tables. JD‑Hudi shows lower task execution time and faster data consumption, confirming the effectiveness of the LSM‑Tree and compaction improvements.

Partial‑Update Foreign‑Key Join

SKU‑wide tables are updated via a foreign‑key index that provides fast primary‑key lookup for non‑primary‑key joins. This enables real‑time partial updates without storing massive state.

Data Lake + AI: Hudi NativeIO SDK

The NativeIO SDK bypasses Spark‑style serialization, reads Parquet files directly from Hudi lake tables, and reduces I/O amplification. In sample‑read benchmarks it achieves roughly a 2× speedup over Spark vectorized reads, facilitating efficient training‑engine data access.

Business Practice

The traffic data warehouse was upgraded to a unified real‑time lake. The custom LSM‑Tree format, remote partitioner and foreign‑key index solved bucket skew, compaction bottlenecks and SKU consistency problems, delivering stable high‑throughput ingestion and low‑latency queries.

Community Contributions & Future Plans

JD’s Hudi team has contributed 109 pull requests, including Incremental Table Service, Partition‑Level Bucket Index and Spark 3.0 integration. The team holds PMC and Committer roles in the Apache Hudi project.

Future work includes:

Adopting the latest Apache Hudi releases and back‑porting JD‑specific enhancements.

Adding multimodal storage and vector indexes to support AI workloads.

Developing a Rust + Arrow based NativeIO layer.

Exploring unified lake‑stream architectures.

Data Lake Overview
Data Lake Overview
Capability Panorama
Capability Panorama
Benchmark Results
Benchmark Results
Foreign‑Key Index Flow
Foreign‑Key Index Flow
NativeIO SDK Architecture
NativeIO SDK Architecture
big datareal-time processingLSM‑Treedata lakeHudi
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.