Big Data 11 min read

JD Retail Data Lake Architecture: Challenges, Optimizations, and Future Plans

This article presents JD Retail's data lake architecture overhaul, detailing the shortcomings of the Lambda model, the migration to Flink‑Hudi‑Spark pipelines, performance gains, storage savings, unified APIs, and upcoming improvements for resilience and automation.

JD Retail Technology

Oct 11, 2024

JD Retail Data Lake Architecture: Challenges, Optimizations, and Future Plans

Background and Pain Points JD Retail originally used a Lambda architecture that ensured data completeness but suffered from high complexity, duplicated systems, and latency issues, especially when reconciling real‑time and batch data.

Iterative Optimizations

1. Architecture Changes – Real‑time topics from production databases were switched from CFS to direct Kafka topics; offline MapReduce jobs were replaced with Flink streaming jobs; Flink writes were directed to Hudi tables, enabling incremental processing, indexing, and transactional guarantees.

2. Multi‑Stream Merging – Various business streams (self‑operated, POP, book, etc.) are ingested via binlog, transformed into Hudi BDM tables, and then merged into unified GDM/RDDM models using partitioned, MOR, and bucketed storage to improve performance and reduce small files.

3. Cost Reduction – Resource reuse across tables, automated DMS table creation, and unified monitoring lower both infrastructure and operational expenses.

4. Data Consistency – Primary‑key hash partitioning preserves order; Hudi's heartbeat and pre‑combine mechanisms ensure data integrity and timely updates.

5. Sustainability – Monitoring for backlog, task failures, and checkpoints; metadata updates for schema changes; resource isolation for stable batch runs.

Effects and Benefits

1. Timeliness – Near‑real‑time processing reduced job duration from 3‑4 hours to about 20 minutes.

2. Job Efficiency – Atomic data modifications and reduced wide‑table construction cut resource usage; only changed data is rewritten, saving thousands of compute cycles.

3. Storage Savings – Switching from full snapshots to incremental storage with Hudi's time‑travel and savepoint features cut storage consumption by roughly 90% for petabyte‑scale product data.

4. Unified API and Consistent Metrics – A unified stream‑batch pipeline provides a single query API, eliminating Lambda‑style duplication and halving maintenance overhead.

5. Query Layering – Integration with Trino, ClickHouse, and StarRocks adds indexing and query acceleration, enabling a layered query architecture.

Future Outlook and Plans

Upcoming work includes disaster recovery mechanisms, resource isolation for elastic scaling, automated compaction for Hudi small‑file mitigation, a data immunity system, and enhanced self‑management of Hudi tables.

run_compaction(op => 'run', path => '{path}');

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Flink Data Lake Spark Hudi

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.