How to Diagnose and Fix ClickHouse CPU Spikes in Minutes
This guide walks you through a step‑by‑step process for quickly identifying the cause of high CPU usage in ClickHouse, from emergency triage and precise diagnosis using system tables to practical optimization techniques and a ready‑to‑run monitoring script.
Emergency Triage – 3 Quick Steps
When ClickHouse CPU spikes, first verify whether the ClickHouse process itself is consuming the CPU.
# Method 1: top for ClickHouse process
top -p $(pgrep clickhouse-server)
# Method 2: htop (install first)
htop -p $(pgrep clickhouse-server)
# Method 3: detailed process info
ps aux | grep clickhouse-server | grep -v grep
# Method 4: per‑CPU usage over 5 seconds
mpstat -P ALL 1 5Normal : CPU < 70%
Warning : 70%–90%
Danger : >90%
Step 2 – Check System Load
Compare system load with the number of CPU cores.
# Quick load check
uptime
# Detailed load averages (1, 5, 15 min)
cat /proc/loadavg
# Real‑time monitoring
watch -n 1 'uptime'For a 16‑core server, load < 16 is normal, 16‑32 indicates light queuing, >32 is dangerous.
Step 3 – Identify Source
Check concurrent queries and active background tasks.
# Concurrent queries count
clickhouse-client --query "SELECT count() AS concurrent_queries FROM system.processes WHERE query != ''"
# Active merges and mutations
clickhouse-client --query "SELECT 'Merges' AS task_type, count() AS active_count FROM system.merges UNION ALL SELECT 'Mutations' AS task_type, count() AS active_count FROM system.mutations WHERE is_done=0"If concurrent queries exceed CPU cores, query concurrency is the problem.
If merges > 15 or mutations > 3, background tasks are stealing CPU.
Precise Diagnosis – 4 Dimensions (covers ~90% of cases)
Dimension 1 – Running Queries
clickhouse-client --query "SELECT query_id, user, address, query, elapsed, read_rows, formatReadableSize(read_bytes) AS read_bytes, formatReadableSize(memory_usage) AS memory FROM system.processes WHERE query != '' ORDER BY elapsed DESC FORMAT Vertical"Filter long‑running queries (>10 s):
clickhouse-client --query "SELECT query_id, user, query, elapsed, read_rows, formatReadableSize(read_bytes) AS read_bytes FROM system.processes WHERE query != '' AND elapsed > 10 ORDER BY elapsed DESC"Kill a problematic query:
clickhouse-client --query "KILL QUERY WHERE query_id='YOUR_QUERY_ID'"Dimension 2 – Historical Slow Queries
clickhouse-client --query "SELECT query_id, user, query, query_start_time, query_duration_ms, read_rows, formatReadableSize(read_bytes) AS read_bytes FROM system.query_log WHERE type = 2 AND query_duration_ms > 1000 ORDER BY query_duration_ms DESC LIMIT 100"Group similar slow queries to find hot patterns:
clickhouse-client --query "SELECT normalizeQuery(query) AS query_pattern, count() AS query_count, sum(query_duration_ms) AS total_duration_ms, sum(read_rows) AS total_read_rows FROM system.query_log WHERE type = 2 AND event_time > now() - INTERVAL 24 HOUR AND query_duration_ms > 1000 GROUP BY query_pattern ORDER BY total_duration_ms DESC LIMIT 20"Dimension 3 – Background Merge & Mutation Tasks
Active merges:
clickhouse-client --query "SELECT database, table, elapsed, progress, merge_type, formatReadableSize(total_bytes_to_merge) AS total_size FROM system.merges ORDER BY elapsed DESC FORMAT Vertical"<5 merges – normal
5‑15 merges – warning
>15 merges – dangerous
Stuck mutations:
clickhouse-client --query "SELECT database, table, mutation_id, command, create_time, latest_fail_reason FROM system.mutations WHERE is_done = 0 ORDER BY create_time DESC FORMAT Vertical"Active mutation count: 0 normal, >3 requires attention.
Dimension 4 – Table Partitioning & Compression Settings
Check partition distribution (many small partitions or huge partitions increase CPU):
clickhouse-client --query "SELECT database, table, partition, count() AS parts_count, sum(rows) AS total_rows, formatReadableSize(sum(bytes_on_disk)) AS total_size FROM system.parts WHERE active = 1 GROUP BY database, table, partition ORDER BY sum(bytes_on_disk) DESC LIMIT 50"Inspect compression algorithms and ratios:
clickhouse-client --query "SELECT database, table, format, round(sum(data_uncompressed_size) / sum(bytes_on_disk), 2) AS compression_ratio FROM system.parts WHERE active = 1 GROUP BY database, table, format ORDER BY sum(bytes_on_disk) DESC LIMIT 30"LZ4 – fast, low CPU, moderate compression (good for query‑heavy workloads).
ZSTD – high compression but CPU‑intensive; use low level (e.g., ZSTD(1)) only for storage‑heavy, query‑light workloads.
Optimization Solutions – 6 Practical Fixes
Solution 1 – Limit Query Concurrency
# Session‑level limit (temporary)
clickhouse-client --query "SET max_concurrent_queries=8" # set to number of CPU cores
# Permanent change in config.xml (requires restart)
<default>
<max_concurrent_queries>16</max_concurrent_queries>
<max_thread_pool_size>16</max_thread_pool_size>
</default>Set a query timeout to avoid long‑running queries:
clickhouse-client --query "SET max_execution_time=60" # 60 sSolution 2 – Optimize Slow Queries
Run EXPLAIN to detect full table scans.
Ensure ORDER BY fields are part of the primary key.
Use PREWHERE to filter early.
Avoid SELECT *; select only needed columns.
Apply sensible LIMIT clauses.
Solution 3 – Adjust Merge Settings
Reduce merge aggressiveness during peak hours by editing config.xml (inside <merge_tree> section):
<merge_tree>
<max_bytes_to_merge_at_max_space_in_pool>16106127360</max_bytes_to_merge_at_max_space_in_pool> <!-- 150 GB -->
<background_pool_size>16</background_pool_size> <!-- background merge threads -->
</merge_tree>Never run OPTIMIZE TABLE during business peaks.
Solution 4 – Clean Stuck Mutation Tasks
# Kill a specific mutation
clickhouse-client --query "KILL MUTATION WHERE database='db' AND table='tbl' AND mutation_id='ID'"Prefer partition‑level deletes ( DROP PARTITION) over row‑level DELETE/UPDATE to reduce CPU load.
Solution 5 – Switch Compression to LZ4
# Create a new table with LZ4
CREATE TABLE new_tbl (col1 Type, col2 Type) ENGINE = MergeTree() ORDER BY pk SETTINGS codec = LZ4;
# Alter an existing column (requires table rebuild)
ALTER TABLE old_tbl MODIFY COLUMN col TYPE CODEC(LZ4);Solution 6 – Batch Data Ingestion
Combine multiple rows into a single INSERT to avoid many tiny partitions:
# Bad: many single‑row inserts
INSERT INTO tbl VALUES (1);
INSERT INTO tbl VALUES (2);
# Good: batch insert
INSERT INTO tbl VALUES (1), (2), (3);
# Reduce insert thread count and use asynchronous inserts when appropriate
clickhouse-client --query "SET max_insert_threads=4";
INSERT INTO tbl ASYNC VALUES (...);Long‑Term Monitoring – One‑Click Script
The following Bash script aggregates all the checks above and can be scheduled via cron to provide a periodic health report.
#!/bin/bash
CLICKHOUSE_CLIENT="clickhouse-client"
# Verify clickhouse-client availability
if ! command -v $CLICKHOUSE_CLIENT &>/dev/null; then
echo "⚠️ clickhouse-client not found – monitoring disabled"
exit 1
fi
echo "=================================================="
echo "📊 ClickHouse CPU monitoring report $(date '+%Y-%m-%d %H:%M:%S')"
echo "=================================================="
# 1. System CPU usage
echo "1. System CPU usage:"
top -bn1 | grep "Cpu(s)" | awk '{print " User: "$2"% | System: "$4"% | Idle: "$8"%"}'
echo
# 2. ClickHouse process resource usage
CH_PID=$(pgrep clickhouse-server)
if [ -n "$CH_PID" ]; then
top -bn1 -p $CH_PID | tail -1 | awk '{print " PID: "$1" | CPU: "$9"% | MEM: "$10"%"}'
else
echo " ❌ ClickHouse process not running"
fi
echo
# 3. Current concurrent queries
echo "3. Current concurrent queries:"
$CLICKHOUSE_CLIENT --query "SELECT count() AS concurrent_queries FROM system.processes WHERE query != ''"
echo
# 4. Active background tasks (merges & mutations)
echo "4. Active background tasks:"
$CLICKHOUSE_CLIENT --query "SELECT 'Merges' AS task_type, count() AS active_count FROM system.merges UNION ALL SELECT 'Mutations' AS task_type, count() AS active_count FROM system.mutations WHERE is_done=0"
echo
# 5. Slow queries in the last hour (>1 s)
echo "5. Slow queries in the last hour (>1 s):"
$CLICKHOUSE_CLIENT --query "SELECT count() AS slow_queries FROM system.query_log WHERE type=2 AND event_time > now() - INTERVAL 1 HOUR AND query_duration_ms > 1000"
echo "=================================================="
echo "📋 Monitoring finished"
echo "=================================================="Make the script executable and add a cron entry, e.g.:
* * * * * /path/to/clickhouse_cpu_monitor.sh >> /var/log/clickhouse_cpu_monitor.log 2>&1Key Takeaways
When CPU spikes, first "stop the bleed" by killing heavy queries or limiting concurrency.
~90% of issues stem from query patterns; optimize SQL before tweaking server configs.
Deploy continuous monitoring to catch problems early rather than firefighting.
Xiao Liu Lab
An operations lab passionate about server tinkering 🔬 Sharing automation scripts, high-availability architecture, alert optimization, and incident reviews. Using technology to reduce overtime and experience to avoid major pitfalls. Follow me for easier, more reliable operations!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
