Practical Experience Building a Real‑Time Clickstream Data Warehouse with Flink and ClickHouse
This article shares practical insights on designing and operating a real‑time clickstream data warehouse using Flink for streaming processing and ClickHouse for near‑real‑time OLAP, covering dimensional modeling, layered architecture, Flink‑ClickHouse sink implementation, and data rebalancing strategies.
Preface
Flink and ClickHouse are leading open‑source frameworks in the real‑time computing and near‑real‑time OLAP domains, respectively, and many large companies combine them to build efficient real‑time platforms. This article briefly introduces our team’s practical experience with a clickstream real‑time data warehouse.
Clickstream and Dimensional Modeling
Clickstream refers to the trace data left by users when they visit websites or apps, forming the basis for traffic and user‑behavior analysis. Typically stored as access logs and event logs, clickstream data is massive and rich in dimensions; for a medium‑size e‑commerce platform we handle over 200 GB and roughly one billion raw logs daily, with more than 100 events and over 50 dimensions.
Following Kimball’s dimensional modeling theory, the clickstream warehouse adopts a classic star schema, as shown in the diagram.
Clickstream Data Warehouse Layered Design
The layered design of a real‑time clickstream warehouse can still borrow from traditional warehouse solutions, keeping the architecture flat to minimize data transfer latency. The diagram below illustrates the layers.
DIM layer: dimension layer, MySQL mirror store containing all dimension data.
ODS layer: source‑level layer, raw data ingested directly from Flume into corresponding Kafka topics.
DWD layer: detail layer, where Flink performs necessary ETL and real‑time dimension joins, writes normalized detail data back to Kafka for downstream consumption, and also writes the data to ClickHouse and Hive as wide tables (ClickHouse for query/analysis, Hive for backup and data‑quality assurance).
DWS layer: service layer, some metrics are aggregated in real time by Flink into Redis for dashboard use, while others are periodically aggregated via ClickHouse materialized views to produce reports and heatmaps. Detailed data in this layer also supports ad‑hoc queries such as funnels, retention, and user paths, showcasing ClickHouse’s superiority over other OLAP engines.
Key Points and Considerations
Flink Real‑time Dimensional Join
Flink’s asynchronous I/O mechanism greatly simplifies accessing external storage within streaming jobs. For our scenario, three points deserve attention:
Use an async MySQL client such as Vert.x MySQL Client, add an in‑memory cache (e.g., Guava Cache, Caffeine) inside the AsyncFunction, and configure a proper eviction policy to avoid excessive MySQL requests. Real‑time dimension joins are suitable for slowly changing dimensions (e.g., geographic, product, category), while fast‑changing dimensions (e.g., user profiles) are better handled by mapping MySQL tables directly to ClickHouse, which supports heterogeneous queries and small‑scale dimension joins. In the future we may use MaterializedMySQL engine (still unreleased) to mirror some dimension tables via binlog to ClickHouse.
Flink‑ClickHouse Sink Design
Writing to ClickHouse via JDBC (flink‑connector‑jdbc) is possible but inflexible. The clickhouse‑jdbc project provides a BalancedClickhouseDataSource component for cluster‑aware writes; we built a Flink‑ClickHouse sink based on it, focusing on three aspects:
Write to local tables rather than distributed tables.
Control write frequency by batch size and batch interval (we use 10,000 rows or 15 seconds, whichever comes first) to balance merge pressure and latency.
BalancedClickhouseDataSource achieves load balancing via random routing and health checks, but lacks failover; if a node becomes unreachable, data may be lost. Therefore we added a retry mechanism with configurable attempts and intervals, and after exhausting retries the failed batch is persisted to a configured path and an alarm is raised.
Currently we only implement the DataStream API style sink; a future SQL‑style sink is planned for the Flink‑SQL wave and will be contributed back to the community. We also intend to add round‑robin and sharding‑key‑hash routing options.
ClickHouse does not support transactions, so exactly‑once semantics via 2PC are unnecessary. If the Flink‑to‑ClickHouse pipeline fails and the job restarts, it will resume from the latest Kafka offset, and any lost data can be back‑filled from Hive.
ClickHouse Data Rebalancing
After expanding a ClickHouse cluster, rebalancing data (resharding) is cumbersome because there is no built‑in tool like HDFS Balancer. A naïve approach is to adjust shard weights in the configuration so new shards receive more writes until balance is achieved, but this creates hot‑spot issues and only works for writes to distributed tables.
We adopt a more involved method: rename the original table, create a new table with the same schema on all nodes, write real‑time data to the new table, use the clickhouse‑copier tool to migrate historical data, and finally drop the old table. During migration the rebalanced table is unavailable, which is not ideal.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
