Big Data 7 min read

Real-time Click Stream Data Warehouse with Flink and ClickHouse: Architecture, Layered Design, and Practical Tips

This article explains how to build a real‑time click‑stream data warehouse using Flink for stream processing and ClickHouse for near‑real‑time OLAP, covering click‑stream characteristics, dimensional modeling, layered warehouse design, async dimension joins, sink implementation, and data rebalancing strategies.

Architect
Architect
Architect
Real-time Click Stream Data Warehouse with Flink and ClickHouse: Architecture, Layered Design, and Practical Tips

Flink and ClickHouse are leading open‑source frameworks in real‑time computation and near‑real‑time OLAP; many large enterprises combine them to build high‑performance real‑time platforms.

Click stream refers to the trace data left by users when they visit websites or apps, typically stored as access logs and event logs; a medium‑size e‑commerce platform can generate about 200 GB of raw logs per day, billions of records, over 100 event types and more than 50 dimensions.

Following Kimball’s dimensional modeling, the click‑stream warehouse adopts a classic star schema, with dimensions stored in a MySQL mirror (DIM layer) and fact tables built on top.

The warehouse is organized into four layers: DIM (dimension data in MySQL), ODS (raw data ingested from Kafka via Flume), DWD (detail layer where Flink performs ETL and real‑time dimension joins, then writes to ClickHouse for analytics and Hive for backup), and DWS (service layer providing real‑time aggregates to Redis and materialized views in ClickHouse for reports and ad‑hoc BI queries).

For real‑time dimension joins in Flink, use an asynchronous MySQL client such as Vert.x MySQL Client, add an in‑memory cache (e.g., Guava or Caffeine) with proper eviction, and limit joins to slowly changing dimensions like geographic or product data.

The Flink‑ClickHouse sink is built on the clickhouse‑jdbc BalancedClickhouseDataSource; data is batched (10 000 rows or 15 s interval) to balance merge pressure and latency, with configurable retry logic and a fallback to file storage when all retries fail, noting that ClickHouse lacks transaction support.

When expanding a ClickHouse cluster, rebalancing data is challenging; a simple approach of adjusting shard weights can cause hotspots, so the authors rename the original table, create a new table with the same schema on all nodes, stream new data into the new table, and use clickhouse‑copier to migrate historical data, accepting temporary service downtime.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkReal-time analyticsClickHouseData WarehouseClick Stream
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.