Using ClickHouse for High‑Performance Keyword Hit Statistics
This article presents the background and challenges of large‑scale keyword hit statistics, explains why traditional MySQL solutions struggle, and details how ClickHouse’s columnar storage, vectorized execution, and distributed architecture provide fast, scalable analytics, including cluster setup, table schema, queries, and migration lessons.
1. Background
1.1 Business Requirement
Audit systems need daily statistics of keyword hit counts, with filtering and sorting based on trigger times and deletion rates.
1.2 Problem Difficulty
Keyword table contains ~700,000 rows; daily statistics generate ~21,000,000 rows.
Deletion rate = delete_times / trigger_times, requiring aggregation of all trigger and delete events before filtering and sorting.
1.2.1 Traditional MySQL Solution
Data is sharded by month; each shard holds up to 21 million rows. Queries must identify relevant shards, filter, aggregate, sort, and merge results, which is costly.
1.2.2 MySQL‑Java Solution
Aggregation and sorting steps are moved to JVM memory to avoid overloading MySQL nodes, but indexing offers little benefit for the required filters.
1.2.3 Issues with Traditional Approaches
Flexible filter conditions lead to many indexes, hurting write performance.
Aggregated data is filtered and sorted, making indexes ineffective.
Large data volume overwhelms the database and degrades concurrency.
2. ClickHouse Solution
2.1 ClickHouse Overview
ClickHouse is an open‑source high‑performance analytical SQL database (OLAP) developed by Yandex in 2016. It stores data column‑wise, uses vectorized execution, and provides fast compression and high‑throughput queries.
Key advantages:
Columnar storage with vectorized processing for efficient CPU usage.
High compression ratios (10×+), reducing I/O and improving cache effectiveness.
Non‑B‑tree indexes that do not require left‑most prefix.
Very fast writes (50‑200 M rows/s).
2.2 How ClickHouse Solves the Problems
2.2.1 Columnar Storage – Flexible Filters & Fast Response
Only columns needed for a query are read, dramatically lowering I/O. Same‑type column data compresses well, yielding small on‑disk sizes and better cache utilization.
2.2.2 Multi‑Core Parallelism & Vectorized Execution
Data is partitioned and processed concurrently across CPU cores. Vectorized execution runs SIMD instructions on batches of column values, reducing function‑call overhead and cache misses.
2.2.3 Data Sharding & Distributed Computing
ClickHouse distributes data across shards and replicas. Queries are automatically split into tasks and executed in parallel on multiple nodes, then results are merged.
2.2.4 OLAP‑Oriented Technology Selection
Compared with TiDB and DorisDB, ClickHouse offers superior query performance for large‑scale aggregation, better support for array‑type columns, and simpler update handling for the keyword use‑case.
3. Using ClickHouse for Keyword Hit Statistics
3.1 Cluster Creation
A cluster named cluster_3shards_2replicas with 6 nodes (3 shards, 2 replicas) is deployed across three physical machines.
3.2 Table Design
Two tables are created: a keyword metadata table and a daily hit‑statistics table.
3.2.1 Keyword Table DDL
create table monitor_banword_local on cluster cluster_3shards_2replicas (
id Int32 COMMENT '关键词ID',
type UInt8 COMMENT '关键词类型',
name String COMMENT '关键词',
product Array(UInt8) COMMENT '适用产品',
parent_id UInt8 COMMENT '级别',
is_core UInt8 COMMENT '是否核心词:0:否;1:是',
status UInt8
) ENGINE = MergeTree
PARTITION BY type
ORDER BY type3.2.2 Hit Statistics Table DDL
CREATE TABLE banword_hit_statistics_mt_2021_local on cluster cluster_3shards_2replicas (
id Int32 COMMENT '统计ID',
hit_banword_id Int32 COMMENT '命中关键词ID',
trigger_times Int32 COMMENT '触发次数',
delete_times Int32 COMMENT '删除次数',
delete_rate Decimal(9,6) COMMENT '删除率',
snapshot_date Date COMMENT '数据快照时间',
update_time Date COMMENT '更新时间'
) ENGINE = MergeTree
PARTITION BY toYYYYMM(snapshot_date)
ORDER BY (snapshot_date, hit_banword_id)3.2.3 Distributed Views
CREATE TABLE monitor_banword_all ON CLUSTER cluster_3shards_2replicas (
id Int32 COMMENT '关键词ID',
type UInt8 COMMENT '关键词类型',
name String COMMENT '关键词',
product Array(UInt8) COMMENT '适用产品',
parent_id UInt8 COMMENT '级别',
is_core UInt8 COMMENT '是否核心词:0:否;1:是',
status UInt8
) ENGINE = Distributed(cluster_3shards_2replicas, default, monitor_banword_local, intHash64(type));
CREATE TABLE banword_hit_statistics_mt_2021_all ON CLUSTER cluster_3shards_2replicas (
id Int32 COMMENT '统计ID',
hit_banword_id Int32 COMMENT '命中关键词ID',
trigger_times Int32 COMMENT '触发次数',
delete_times Int32 COMMENT '删除次数',
delete_rate Decimal(9,6) COMMENT '删除率',
snapshot_date Date COMMENT '数据快照时间',
update_time Date COMMENT '更新时间'
) ENGINE = Distributed(cluster_3shards_2replicas, default, banword_hit_statistics_mt_2021_local, toMonth(snapshot_date));3.2.4 Sample Query
select
hit_banword_id,
sum(trigger_times) as triggerTimes,
sum(delete_times) as deleteTimes,
if(deleteTimes>0, divide(deleteTimes, triggerTimes), 0) as deleteRate
from default.banword_hit_statistics_mt_2021_all
where hit_banword_id GLOBAL IN (
select id from monitor_banword_all where parent_id = 2
)
and snapshot_date between toDate('2022-01-01') and toDate('2022-07-01')
group by hit_banword_id
having triggerTimes > 1
order by deleteRate desc
limit 5000;The query processes 120 million rows in about 1.8 seconds, demonstrating ClickHouse’s speed.
3.3 Migration Pitfalls
Too many partitions per INSERT block – increase max_partitions_per_insert_block or set it to 0.
SQL syntax nuances: use backticks for column names, specify the database explicitly, and use ALTER … ON CLUSTER … DELETE/UPDATE for distributed tables.
3.4 Comparison with Traditional Solutions
MySQL
MySQL‑Java
ClickHouse
1 day (0.7 M rows)
5 s
12 min
350 ms
1 month (21 M rows)
2 min
3.7 h
<2 s
6 months (120 M rows)
>20 min
>12 h
<2 s
ClickHouse consistently delivers sub‑second response times even on massive datasets, while MySQL‑based approaches become impractically slow.
4 Summary & Outlook
ClickHouse’s columnar storage, high compression, vectorized execution, and distributed architecture make it ideal for large‑scale statistical workloads. It offers simple JDBC connectivity and rich SQL support, though it lacks strong concurrency for mixed workloads and does not provide transactions.
Future work includes extending the approach to other structured data analytics such as AI‑generated text detection statistics and historical audit record analysis.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.