How Flink + ClickHouse Power Real‑Time Analytics at Scale
This article explains how FunTouTiao builds a high‑performance real‑time analytics pipeline using Flink, Hive, and ClickHouse, covering business scenarios, hour‑level and second‑level Flink‑to‑Hive architectures, streaming file sink mechanics, multi‑user permissions, ClickHouse performance tricks, and future roadmap for unified stream‑batch storage.
Business Scenario and Current State
FunTouTiao uses a combination of Flink and ClickHouse to provide real‑time data reports, ad‑hoc queries, event analysis, funnel and retention analysis, achieving 80% of queries within one second and greatly improving user experience.
Flink‑to‑Hive Hour‑Level Scenario
The hour‑level pipeline extracts binlog data and log server data into Kafka, then Flink writes to HDFS. A monitoring program watches Flink consumption; when the event time reaches the target hour, Flink triggers Hive partition creation (add partition, alert table, etc.).
Implementation relies on Flink’s StreamingFileSink with features such as:
forBulkFormat supporting Avro and Parquet.
withBucketAssigner for event‑time based bucketing.
OnCheckpointRollingPolicy to control file roll‑over by checkpoint interval or size.
Exactly‑Once semantics via Flink’s checkpointing mechanism.
Exactly‑Once is realized through a two‑phase commit: the coordinator sends a prepare message, participants acknowledge, then a commit is issued. Flink’s source receives checkpoint barriers, snapshots state, and notifies completion, matching the classic two‑phase commit protocol.
Flink’s StreamingFileSink implements CheckpointedFunction (initializeState, snapshotState) and CheckpointListener (notifyCheckpointComplete). During a checkpoint, files transition from in‑progress to pending , then to finished via rename after the checkpoint completes.
Cross‑Cluster Nameservices
Real‑time and offline clusters are separate. To avoid mixing HA configurations, the Flink job injects a final tag into HDFS XML configuration, marking the current cluster as either stream (real‑time) or date (offline), allowing each side to keep its own nameservices without mutual modification.
Multi‑User Write Permissions
Because Flink jobs run under a single user while offline Hive may have many users, an API extension withBucketUser was added. It sets the HDFS user for each write, leveraging Hadoop’s ugi.doAs to impersonate the target user, enabling per‑user writes and mitigating small‑file proliferation through periodic merges.
Flink‑to‑Hive Second‑Level (Sub‑Second) Scenario
For sub‑second latency, Flink streams data to ClickHouse after Hive ingestion. ClickHouse stores data locally on SSDs, enabling fast OLAP queries. The pipeline supports both batch and ad‑hoc queries via Horizon, QE, and user‑profile services.
Why Flink + ClickHouse?
SQL‑based metric definition replaces StormSQL, offering richer UDF support.
Independent deployment of new metrics avoids interfering with existing jobs.
Data back‑traceability aids root‑cause analysis of anomalies.
High‑throughput computation finishes all metric calculations within minutes.
Simplified operations with separate streaming and batch deployments.
Why ClickHouse Is Fast
Columnar storage with LZ4/ZSTD compression.
Compute‑storage locality and vectorized execution.
LSM‑style MergeTree with automatic background merges and indexing.
SIMD + LLVM optimizations.
Rich SQL syntax and UDFs for complex windowed analytics.
ClickHouse provides two table types: local (for writes) and distributed (for reads). Writes are batched (5‑10 w rows per batch) with a typical 5‑second cycle.
Connector Implementations
RoundRobinClickHouseDataSource (custom implementation).
BalancedClickHouseDataSource (official API usage).
Both handle connection testing (testOnBorrow, testOnReturn, testWhileIdle) and periodic health checks (scheduleActualization, scheduleConnectionsCleaning) to maintain a healthy pool of ClickHouse nodes.
Backfill and Fault Tolerance
Flink and ClickHouse each provide hour‑level fault tolerance. When failures occur, the pipeline resumes from the latest offset, using Hive as a fallback to re‑import data into ClickHouse after the cluster recovers.
Future Directions
SQL‑driven connectors to describe Flink‑to‑Hive and Flink‑to‑ClickHouse flows.
Integration of Delta Lake‑style unified storage to enable true stream‑batch processing without duplicated data stores.
Overall, the architecture demonstrates how Flink’s streaming capabilities combined with ClickHouse’s fast columnar engine can deliver sub‑second analytics at petabyte scale.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
