Big Data 10 min read

How JD.com Transformed Its Traffic Data Pipeline from Lambda to a Lakehouse Architecture

This article examines JD.com's migration of its massive traffic data processing from a dual Lambda architecture to an integrated lakehouse solution, detailing the challenges, innovative optimizations with Flink and Hudi, performance gains, cost reductions, and future directions for real‑time data handling.

JD Retail Technology
JD Retail Technology
JD Retail Technology
How JD.com Transformed Its Traffic Data Pipeline from Lambda to a Lakehouse Architecture

Background and Pain Points

JD.com originally used a Lambda architecture with separate offline and real‑time processing pipelines. Real‑time data was collected, parsed, and sent to a message queue before being processed by Flink, while offline data was stored in object storage and processed with Hive. This dual‑chain design caused duplicated compute resources, high storage costs, and data inconsistency between real‑time streams and offline tables.

Challenges and Optimizations

Multi‑IO Capability Optimization : JD.com introduced a custom multi‑IO layer that abstracts storage, allowing lake tables to use a buffer layer (HDFS) and a persistent layer (Hudi). This design improved write throughput by 104% and reduced checkpoint latency by 95%, stabilizing jobs with a 97% reduction in failures.

Dynamic Partition Strategy : To address severe data skew (up to 730×), a custom Partitioner routes large partitions to dedicated subtasks, cutting the number of files per commit from ~108,000 to 6,000 (94% reduction) and more than doubling write performance.

Concurrent Append Writes : By splitting large partitions into separate Flink append tasks and managing per‑task metadata, JD.com achieved lightweight concurrent appends, eliminating checkpoint conflicts and improving real‑time write performance.

Deserialization Optimization : JSON payloads with 120 extra fields increased row‑level deserialization cost. JD.com moved deserialization to a dedicated lookup table operator and filtered unnecessary fields in Kafka headers, dramatically lowering overhead.

Batch‑to‑Stream Scheduling : A flow‑batch mechanism monitors lake table watermarks; when the minimum partition watermark passes the T‑1 threshold and clustering is complete, downstream batch jobs are triggered, ensuring accurate and timely offline processing.

Results and Impact

Data latency improved from T+1 to a 15‑minute SLA, moving readiness from 02:30 to 00:30. Storage usage dropped from 120 TB to 50 TB, cutting costs by 58.3% and saving over ¥2 million annually. During the Double‑11 promotion, traffic exposure models grew 80% YoY, with the new architecture delivering stable, zero‑delay processing and advancing data readiness by 5 hours.

Future Outlook

JD.com plans to scale large‑scale stream reads in production, enhance metric monitoring, and achieve sub‑second Hudi processing by leveraging high‑speed storage (Kafka, HBase, Redis) alongside lake table metadata.

Architecture diagram
Architecture diagram
Performance chart
Performance chart
Partition strategy illustration
Partition strategy illustration
Append concurrency diagram
Append concurrency diagram
Deserialization flow
Deserialization flow
Batch scheduling
Batch scheduling
data engineeringbig datareal-time processingFlinklakehouseHudi
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.