Big Data 25 min read

Real‑time Data Warehouse Evolution with Data Lake: Challenges, Solutions, and Future Outlook

This article presents a comprehensive overview of JD Tech's real‑time data warehouse evolution, detailing the legacy Lambda architecture, its shortcomings, the integration of a data‑lake‑based solution, iterative redesigns, technical trade‑offs, and future directions for real‑time analytics.

DataFunTalk
DataFunTalk
DataFunTalk
Real‑time Data Warehouse Evolution with Data Lake: Challenges, Solutions, and Future Outlook

Introduction

The session, led by Chen Weiqiang, head of JD Tech's real‑time data warehouse, introduces the concept of a real‑time data warehouse (RTDW), distinguishes it from traditional offline warehouses, and outlines the agenda of five parts covering legacy architecture, problems, lake‑based redesign, new architecture, and a Q&A.

1. Real‑time Data Warehouse before the Data Lake

RTDW is described as a solution that goes beyond offline warehouses, often acting as a superset. JD Tech previously employed a Lambda‑style architecture where batch data was extracted from databases and incremental data came from binlog and logs. The compute layer was largely separated from the offline side, with Flink rarely used, while the storage layer combined independent offline and real‑time datasets, using ClickHouse (CK), Redis, and limited MySQL.

Data integration relied on registering tables in a query service, manually configuring Flink jobs, and exposing a JDBC‑like interface that internally intercepted calls for view transformation and additional services such as caching.

2. Problems of the Old Architecture

Advantages: Independent construction without introducing new tech stacks, allowing iterative development.

Disadvantages: Inherited Lambda‑related issues and unclear RTDW positioning, leading to tight coupling of service layers and difficulty in handling transactions, black‑white lists, and metadata governance.

Technical issues were categorized into push‑based and pull‑based solutions, covering database transaction problems, dimension handling, snapshot consistency, and metadata debt.

3. Combining Data Lake with Real‑time Data Warehouse

The new lake‑based design replaces the mixed stream‑batch storage with a data lake (using Hudi on OSS) and shifts the data source to binlog. The service layer is streamlined, and query services are unified through a platform‑level data catalog.

Real‑time libraries are split between warehouse‑side and product‑side, reducing customizations. CK remains for certain workloads, while Hudi provides upsert and snapshot capabilities.

4. New Architecture Issues and Future Outlook

Warehouse‑side challenges include meeting a 10‑minute latency target with Hudi, extensive public‑layer refactoring, and source‑layer CDC adoption for MySQL. Non‑technical challenges involve high development effort and iteration costs.

Future work focuses on improving latency (e.g., allowing dirty reads), tighter integration of lake and real‑time libraries, and evolving the public layer to balance stability with rapid iteration.

5. Q&A

Key questions addressed include the use of CK + OSS versus pure lake storage, the production readiness of Paimon, comparisons between CK and Doris, and the trade‑offs of moving batch processing from Spark to Flink.

Overall, the adoption of a lake‑warehouse integration has reduced complexity, shortened delivery cycles, and clarified the boundaries between offline, real‑time, and near‑real‑time processing.

Big DataFlinkClickHousereal-time data warehousedata lakeHudistreaming architecture
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.