Real-time Data Processing at QuTouTiao: Flink + ClickHouse Architecture and Practices
QuTouTiao leverages Flink and ClickHouse to build a high‑performance real‑time analytics platform that supports hourly Hive pipelines and sub‑second ClickHouse queries, achieving sub‑second response for 80% of requests through streaming ingestion, exactly‑once semantics, multi‑cluster coordination, and optimized ClickHouse storage and connector designs.
QuTouTiao uses large‑scale data analysis to drive business development, employing a Flink + ClickHouse solution for real‑time reporting, ad‑hoc queries, event analysis, funnel and retention analytics, with 80% of responses completed within one second.
The presentation covers four main topics: business scenario analysis, Flink‑to‑Hive hourly pipelines, Flink‑to‑ClickHouse sub‑second pipelines, and future plans.
Hourly Flink‑to‑Hive pipeline : Binlog and log server data are streamed to Kafka, then Flink writes to HDFS. A checkpoint‑driven process monitors event time and creates Hive partitions (hourly, half‑hourly, or minute‑level) using a custom bucket assigner and StreamingFileSink with exactly‑once semantics.
The StreamingFileSink relies on forBulkFormat (Avro/Parquet), withBucketAssigner, OnCheckpointRollingPolicy, and implements a two‑phase commit protocol via CheckpointedFunction and CheckpointListener to ensure data consistency.
Sub‑second Flink‑to‑ClickHouse pipeline : Real‑time metrics are computed in Flink and written to ClickHouse, enabling five‑minute or even one‑minute windows for dashboards, ad‑hoc queries, and user profiling. ClickHouse’s columnar storage, LZ4/ZSTD compression, vectorized execution, LSM‑merge trees, SIMD/LLVM optimizations, and efficient connector APIs (RoundRobinClickHouseDataSource, BalancedClickHouseDataSource) provide high throughput.
Key reasons for ClickHouse speed include columnar storage with compression, local data storage and vectorized execution, LSM‑merge tree indexing, and SIMD/LLVM optimizations. The system uses a local‑table‑first write strategy, batch inserts, and periodic merges to avoid small‑file issues.
Multi‑cluster name‑service handling separates real‑time and offline clusters, using HDFS XML configuration with final tags to avoid cross‑cluster interference. Multi‑user write permissions are managed via a custom withBucketUser API that sets the Hadoop UGI for each write.
Backfill and fault‑tolerance mechanisms include hourly Flink and ClickHouse checkpointing, Hive as a fallback store, and a round‑robin ClickHouse datasource with connection testing and idle‑connection cleaning.
Future directions focus on SQL‑based connectors (Flink‑to‑Hive, Flink‑to‑ClickHouse), Delta‑Lake‑style unified storage for stream‑batch processing, and further integration of KV stores (HBase, Kudu, Redis) for real‑time enrichment.
The speaker, Wang Jinhai, has extensive experience in data platforms at companies such as Vipshop, Ele.me, and now leads QuTouTiao’s data center platform.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
