Real‑time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions
This article presents a comprehensive overview of JD Tech's real‑time data warehouse evolution, detailing the legacy Lambda‑based design, its shortcomings, the transition to a data‑lake‑integrated architecture, iterative improvements, encountered technical and non‑technical issues, and future outlooks.
The presentation introduces the concept of a real‑time data warehouse, explaining its purpose and how it differs from traditional offline warehouses.
It first describes the previous Lambda‑style real‑time warehouse architecture used at JD Tech, including its data link, computation layer, and data layer, and outlines the limitations such as tight coupling, unclear positioning, and operational complexities.
The speaker then discusses the problems of the older version, highlighting both technical drawbacks (e.g., batch‑stream integration costs, metadata governance debt) and non‑technical challenges (e.g., high migration cost, iteration overhead).
Next, a data‑lake‑based solution is introduced, showing how the lake replaces the batch‑stream hybrid layer, simplifies the service layer, and separates real‑time and product stores, thereby reducing complexity and improving delivery speed.
The talk walks through three iterative upgrades of the new architecture, detailing the first round (market‑layer to database sync), the second round (merging real‑time and market layers), and the third round (further decoupling and standardization), and notes the practical difficulties encountered during each phase.
Finally, the speaker summarizes the benefits of the lake‑warehouse integration—lowered development cycles, clearer boundaries between offline, real‑time, and near‑real‑time processing, and a more scalable roadmap—while also acknowledging remaining gaps and future expectations.
A Q&A session follows, covering topics such as the use of ClickHouse with OSS, the adoption status of Paimon, comparisons between ClickHouse and Doris, and the trade‑offs of moving batch processing from Spark to Flink.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.