Big Data 24 min read

Real‑time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions

This article presents a comprehensive overview of JD Tech's real‑time data warehouse evolution, detailing the legacy Lambda‑based design, its shortcomings, the transition to a data‑lake‑integrated architecture, iterative improvements, encountered technical and non‑technical issues, and future outlooks.

DataFunSummit

Apr 18, 2024

Real‑time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions

The presentation introduces the concept of a real‑time data warehouse, explaining its purpose and how it differs from traditional offline warehouses.

It first describes the previous Lambda‑style real‑time warehouse architecture used at JD Tech, including its data link, computation layer, and data layer, and outlines the limitations such as tight coupling, unclear positioning, and operational complexities.

The speaker then discusses the problems of the older version, highlighting both technical drawbacks (e.g., batch‑stream integration costs, metadata governance debt) and non‑technical challenges (e.g., high migration cost, iteration overhead).

Next, a data‑lake‑based solution is introduced, showing how the lake replaces the batch‑stream hybrid layer, simplifies the service layer, and separates real‑time and product stores, thereby reducing complexity and improving delivery speed.

The talk walks through three iterative upgrades of the new architecture, detailing the first round (market‑layer to database sync), the second round (merging real‑time and market layers), and the third round (further decoupling and standardization), and notes the practical difficulties encountered during each phase.

Finally, the speaker summarizes the benefits of the lake‑warehouse integration—lowered development cycles, clearer boundaries between offline, real‑time, and near‑real‑time processing, and a more scalable roadmap—while also acknowledging remaining gaps and future expectations.

A Q&A session follows, covering topics such as the use of ClickHouse with OSS, the adoption status of Paimon, comparisons between ClickHouse and Doris, and the trade‑offs of moving batch processing from Spark to Flink.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Streaming clickhouse real-time data warehouse Data Lake Lambda architecture Hudi

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.