Evolution of a Real‑Time Data Warehouse Architecture and Practical Lessons
This article recounts the author’s journey building a real‑time data warehouse using Flink, Kafka, Redis, and ClickHouse, describing the initial batch‑oriented setup, successive architectural evolutions, challenges with wide tables and dimension data, and the final OLAP‑centric solution with secondary caching.
1. Background of Real‑Time Data Warehouse Architecture
Traditional offline warehouses provide data with a T+1 delay, which is insufficient for scenarios like recommendation, risk control, and performance assessment that require immediate data. Early solutions used Flink or Spark Streaming for metric calculation, storing intermediate results in Redis for real‑time dashboards.
The sheer number of metrics and endless business demands highlighted the need for faster development cycles, prompting a shift from pure streaming SQL to a wider table (wide‑table) approach.
Real‑time warehouses should support external services and ad‑hoc OLAP queries.
2. Architectural Evolution
2.1 Initial Stage
Initially, the company relied on Greenplum for a quasi‑real‑time warehouse, pulling data from business databases and analytics systems every 15 minutes, which proved slow and caused redundant calculations as metric count grew.
2.1 Real‑Time Warehouse 0.1
The author learned Flink within a week, built a Flink job for an online analysis requirement, and deployed the first pipeline, but soon faced scalability issues as metric count increased.
2.1 Real‑Time Warehouse 1.0
To simplify development, the team decided to widen tables (wide tables) and store them in Redis for fast dimension lookups within Flink’s map function. Data was ingested from PostgreSQL via triggers into Kafka, then processed by Flink.
Dimension tables (store, category, city, product, promotion) were updated during off‑peak hours, and the widened sales table was stored in ClickHouse after evaluating TiDB, Doris, Druid, and ClickHouse.
As business complexity grew, additional wide tables (inventory, coupons, membership, promotion) were added, leading to large Flink‑ClickHouse writes; the team introduced Waterdrop to simplify the pipeline.
3. Summary
The core idea is to let Flink handle data widening while delegating heavy computation to an OLAP engine, achieving decoupling.
Challenges such as oversized dimension tables were mitigated by adding a secondary cache (HBase) to support real‑time queries for the past three months.
Final architecture combines Flink for streaming, Redis for fast dimension lookups, ClickHouse for OLAP storage, and HBase as a secondary cache.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
