Big Data 24 min read

Real-time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions

This article presents a comprehensive overview of the evolution from traditional Lambda‑based real‑time data warehouse solutions to a data‑lake‑integrated architecture, detailing the shortcomings of legacy designs, the iterative improvements made at JD Technology, and the technical and operational challenges encountered during implementation.

DataFunTalk
DataFunTalk
DataFunTalk
Real-time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions

The presentation begins with an introduction to real‑time data warehouses, explaining their purpose and how they differ from traditional offline warehouses, followed by a five‑part agenda covering legacy architectures, issues, new designs, and a Q&A session.

It then describes the previous Lambda‑style real‑time warehouse used at JD Technology, outlining its data link, computation layer, and storage layer, and highlighting problems such as tight coupling of services, limited visibility, and user expectation mismatches.

The next section examines the drawbacks of the older architecture, including advantages of independent development and disadvantages like inherited Lambda complexities and ambiguous real‑time warehouse positioning.

Subsequently, the new data‑lake‑based solution is introduced, detailing how the data layer now leverages binlog and lake tables, the separation of real‑time and product layers, and the iterative rollout across three major versions, each addressing specific technical and non‑technical challenges.

Further discussion covers the benefits achieved by adopting the lake‑warehouse integration, such as reduced complexity, faster delivery cycles, clearer boundaries between offline, real‑time, and near‑real‑time processing, and improved alignment with business needs.

The article concludes with a summary of the new architecture’s issues, future expectations, and a Q&A segment addressing topics like CK vs. OSS, Paimon adoption, and component choices between Spark and Flink.

architectureBig DataStreamingreal-time data warehousedata lakeLambda architecture
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.