Big Data 9 min read

Evolution of a Real‑Time Data Warehouse Architecture and Practical Lessons

This article recounts the author’s journey building a real‑time data warehouse using Flink, Kafka, Redis, and ClickHouse, describing the initial batch‑oriented setup, successive architectural evolutions, challenges with wide tables and dimension data, and the final OLAP‑centric solution with secondary caching.

Big Data Technology & Architecture

Jan 11, 2021

Evolution of a Real‑Time Data Warehouse Architecture and Practical Lessons

1. Background of Real‑Time Data Warehouse Architecture

Traditional offline warehouses provide data with a T+1 delay, which is insufficient for scenarios like recommendation, risk control, and performance assessment that require immediate data. Early solutions used Flink or Spark Streaming for metric calculation, storing intermediate results in Redis for real‑time dashboards.

The sheer number of metrics and endless business demands highlighted the need for faster development cycles, prompting a shift from pure streaming SQL to a wider table (wide‑table) approach.

Real‑time warehouses should support external services and ad‑hoc OLAP queries.

2. Architectural Evolution

2.1 Initial Stage

Initially, the company relied on Greenplum for a quasi‑real‑time warehouse, pulling data from business databases and analytics systems every 15 minutes, which proved slow and caused redundant calculations as metric count grew.

2.1 Real‑Time Warehouse 0.1

The author learned Flink within a week, built a Flink job for an online analysis requirement, and deployed the first pipeline, but soon faced scalability issues as metric count increased.

2.1 Real‑Time Warehouse 1.0

To simplify development, the team decided to widen tables (wide tables) and store them in Redis for fast dimension lookups within Flink’s map function. Data was ingested from PostgreSQL via triggers into Kafka, then processed by Flink.

Dimension tables (store, category, city, product, promotion) were updated during off‑peak hours, and the widened sales table was stored in ClickHouse after evaluating TiDB, Doris, Druid, and ClickHouse.

As business complexity grew, additional wide tables (inventory, coupons, membership, promotion) were added, leading to large Flink‑ClickHouse writes; the team introduced Waterdrop to simplify the pipeline.

3. Summary

The core idea is to let Flink handle data widening while delegating heavy computation to an OLAP engine, achieving decoupling.

Challenges such as oversized dimension tables were mitigated by adding a secondary cache (HBase) to support real‑time queries for the past three months.

Final architecture combines Flink for streaming, Redis for fast dimension lookups, ClickHouse for OLAP storage, and HBase as a secondary cache.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink ClickHouse OLAP Real-Time Data Warehouse

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.