Big Data 18 min read

How Flink + ClickHouse Power Real‑Time Analytics at Scale

This article explains how FunTouTiao builds a high‑performance real‑time analytics pipeline using Flink, Hive, and ClickHouse, covering business scenarios, hour‑level and second‑level Flink‑to‑Hive architectures, streaming file sink mechanics, multi‑user permissions, ClickHouse performance tricks, and future roadmap for unified stream‑batch storage.

dbaplus Community
dbaplus Community
dbaplus Community
How Flink + ClickHouse Power Real‑Time Analytics at Scale

Business Scenario and Current State

FunTouTiao uses a combination of Flink and ClickHouse to provide real‑time data reports, ad‑hoc queries, event analysis, funnel and retention analysis, achieving 80% of queries within one second and greatly improving user experience.

Flink‑to‑Hive Hour‑Level Scenario

The hour‑level pipeline extracts binlog data and log server data into Kafka, then Flink writes to HDFS. A monitoring program watches Flink consumption; when the event time reaches the target hour, Flink triggers Hive partition creation (add partition, alert table, etc.).

Implementation relies on Flink’s StreamingFileSink with features such as:

forBulkFormat supporting Avro and Parquet.

withBucketAssigner for event‑time based bucketing.

OnCheckpointRollingPolicy to control file roll‑over by checkpoint interval or size.

Exactly‑Once semantics via Flink’s checkpointing mechanism.

Exactly‑Once is realized through a two‑phase commit: the coordinator sends a prepare message, participants acknowledge, then a commit is issued. Flink’s source receives checkpoint barriers, snapshots state, and notifies completion, matching the classic two‑phase commit protocol.

Flink’s StreamingFileSink implements CheckpointedFunction (initializeState, snapshotState) and CheckpointListener (notifyCheckpointComplete). During a checkpoint, files transition from in‑progress to pending , then to finished via rename after the checkpoint completes.

Cross‑Cluster Nameservices

Real‑time and offline clusters are separate. To avoid mixing HA configurations, the Flink job injects a final tag into HDFS XML configuration, marking the current cluster as either stream (real‑time) or date (offline), allowing each side to keep its own nameservices without mutual modification.

Multi‑User Write Permissions

Because Flink jobs run under a single user while offline Hive may have many users, an API extension withBucketUser was added. It sets the HDFS user for each write, leveraging Hadoop’s ugi.doAs to impersonate the target user, enabling per‑user writes and mitigating small‑file proliferation through periodic merges.

Flink‑to‑Hive Second‑Level (Sub‑Second) Scenario

For sub‑second latency, Flink streams data to ClickHouse after Hive ingestion. ClickHouse stores data locally on SSDs, enabling fast OLAP queries. The pipeline supports both batch and ad‑hoc queries via Horizon, QE, and user‑profile services.

Why Flink + ClickHouse?

SQL‑based metric definition replaces StormSQL, offering richer UDF support.

Independent deployment of new metrics avoids interfering with existing jobs.

Data back‑traceability aids root‑cause analysis of anomalies.

High‑throughput computation finishes all metric calculations within minutes.

Simplified operations with separate streaming and batch deployments.

Why ClickHouse Is Fast

Columnar storage with LZ4/ZSTD compression.

Compute‑storage locality and vectorized execution.

LSM‑style MergeTree with automatic background merges and indexing.

SIMD + LLVM optimizations.

Rich SQL syntax and UDFs for complex windowed analytics.

ClickHouse provides two table types: local (for writes) and distributed (for reads). Writes are batched (5‑10 w rows per batch) with a typical 5‑second cycle.

Connector Implementations

RoundRobinClickHouseDataSource (custom implementation).

BalancedClickHouseDataSource (official API usage).

Both handle connection testing (testOnBorrow, testOnReturn, testWhileIdle) and periodic health checks (scheduleActualization, scheduleConnectionsCleaning) to maintain a healthy pool of ClickHouse nodes.

Backfill and Fault Tolerance

Flink and ClickHouse each provide hour‑level fault tolerance. When failures occur, the pipeline resumes from the latest offset, using Hive as a fallback to re‑import data into ClickHouse after the cluster recovers.

Future Directions

SQL‑driven connectors to describe Flink‑to‑Hive and Flink‑to‑ClickHouse flows.

Integration of Delta Lake‑style unified storage to enable true stream‑batch processing without duplicated data stores.

Overall, the architecture demonstrates how Flink’s streaming capabilities combined with ClickHouse’s fast columnar engine can deliver sub‑second analytics at petabyte scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Real-TimeBig Datadata pipelineFlinkStreamingclickhouse
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.