Databases 21 min read

How to Diagnose and Fix ClickHouse CPU Spikes in Minutes

This guide walks you through a step‑by‑step process for quickly identifying the cause of high CPU usage in ClickHouse, from emergency triage and precise diagnosis using system tables to practical optimization techniques and a ready‑to‑run monitoring script.

Xiao Liu Lab

Dec 30, 2025

How to Diagnose and Fix ClickHouse CPU Spikes in Minutes

Emergency Triage – 3 Quick Steps

When ClickHouse CPU spikes, first verify whether the ClickHouse process itself is consuming the CPU.

# Method 1: top for ClickHouse process
 top -p $(pgrep clickhouse-server)

# Method 2: htop (install first)
 htop -p $(pgrep clickhouse-server)

# Method 3: detailed process info
 ps aux | grep clickhouse-server | grep -v grep

# Method 4: per‑CPU usage over 5 seconds
 mpstat -P ALL 1 5

Normal : CPU < 70%

Warning : 70%–90%

Danger : >90%

Step 2 – Check System Load

Compare system load with the number of CPU cores.

# Quick load check
 uptime

# Detailed load averages (1, 5, 15 min)
 cat /proc/loadavg

# Real‑time monitoring
 watch -n 1 'uptime'

For a 16‑core server, load < 16 is normal, 16‑32 indicates light queuing, >32 is dangerous.

Step 3 – Identify Source

Check concurrent queries and active background tasks.

# Concurrent queries count
 clickhouse-client --query "SELECT count() AS concurrent_queries FROM system.processes WHERE query != ''"

# Active merges and mutations
 clickhouse-client --query "SELECT 'Merges' AS task_type, count() AS active_count FROM system.merges UNION ALL SELECT 'Mutations' AS task_type, count() AS active_count FROM system.mutations WHERE is_done=0"

If concurrent queries exceed CPU cores, query concurrency is the problem.

If merges > 15 or mutations > 3, background tasks are stealing CPU.

Precise Diagnosis – 4 Dimensions (covers ~90% of cases)

Dimension 1 – Running Queries

clickhouse-client --query "SELECT query_id, user, address, query, elapsed, read_rows, formatReadableSize(read_bytes) AS read_bytes, formatReadableSize(memory_usage) AS memory FROM system.processes WHERE query != '' ORDER BY elapsed DESC FORMAT Vertical"

Filter long‑running queries (>10 s):

clickhouse-client --query "SELECT query_id, user, query, elapsed, read_rows, formatReadableSize(read_bytes) AS read_bytes FROM system.processes WHERE query != '' AND elapsed > 10 ORDER BY elapsed DESC"

Kill a problematic query:

clickhouse-client --query "KILL QUERY WHERE query_id='YOUR_QUERY_ID'"

Dimension 2 – Historical Slow Queries

clickhouse-client --query "SELECT query_id, user, query, query_start_time, query_duration_ms, read_rows, formatReadableSize(read_bytes) AS read_bytes FROM system.query_log WHERE type = 2 AND query_duration_ms > 1000 ORDER BY query_duration_ms DESC LIMIT 100"

Group similar slow queries to find hot patterns:

clickhouse-client --query "SELECT normalizeQuery(query) AS query_pattern, count() AS query_count, sum(query_duration_ms) AS total_duration_ms, sum(read_rows) AS total_read_rows FROM system.query_log WHERE type = 2 AND event_time > now() - INTERVAL 24 HOUR AND query_duration_ms > 1000 GROUP BY query_pattern ORDER BY total_duration_ms DESC LIMIT 20"

Dimension 3 – Background Merge & Mutation Tasks

Active merges:

clickhouse-client --query "SELECT database, table, elapsed, progress, merge_type, formatReadableSize(total_bytes_to_merge) AS total_size FROM system.merges ORDER BY elapsed DESC FORMAT Vertical"

<5 merges – normal

5‑15 merges – warning

>15 merges – dangerous

Stuck mutations:

clickhouse-client --query "SELECT database, table, mutation_id, command, create_time, latest_fail_reason FROM system.mutations WHERE is_done = 0 ORDER BY create_time DESC FORMAT Vertical"

Active mutation count: 0 normal, >3 requires attention.

Dimension 4 – Table Partitioning & Compression Settings

Check partition distribution (many small partitions or huge partitions increase CPU):

clickhouse-client --query "SELECT database, table, partition, count() AS parts_count, sum(rows) AS total_rows, formatReadableSize(sum(bytes_on_disk)) AS total_size FROM system.parts WHERE active = 1 GROUP BY database, table, partition ORDER BY sum(bytes_on_disk) DESC LIMIT 50"

Inspect compression algorithms and ratios:

clickhouse-client --query "SELECT database, table, format, round(sum(data_uncompressed_size) / sum(bytes_on_disk), 2) AS compression_ratio FROM system.parts WHERE active = 1 GROUP BY database, table, format ORDER BY sum(bytes_on_disk) DESC LIMIT 30"

LZ4 – fast, low CPU, moderate compression (good for query‑heavy workloads).

ZSTD – high compression but CPU‑intensive; use low level (e.g., ZSTD(1)) only for storage‑heavy, query‑light workloads.

Optimization Solutions – 6 Practical Fixes

Solution 1 – Limit Query Concurrency

# Session‑level limit (temporary)
clickhouse-client --query "SET max_concurrent_queries=8"  # set to number of CPU cores

# Permanent change in config.xml (requires restart)
<default>
  <max_concurrent_queries>16</max_concurrent_queries>
  <max_thread_pool_size>16</max_thread_pool_size>
</default>

Set a query timeout to avoid long‑running queries:

clickhouse-client --query "SET max_execution_time=60"  # 60 s

Solution 2 – Optimize Slow Queries

Run EXPLAIN to detect full table scans.

Ensure ORDER BY fields are part of the primary key.

Use PREWHERE to filter early.

Avoid SELECT *; select only needed columns.

Apply sensible LIMIT clauses.

Solution 3 – Adjust Merge Settings

Reduce merge aggressiveness during peak hours by editing config.xml (inside <merge_tree> section):

<merge_tree>
  <max_bytes_to_merge_at_max_space_in_pool>16106127360</max_bytes_to_merge_at_max_space_in_pool> <!-- 150 GB -->
  <background_pool_size>16</background_pool_size> <!-- background merge threads -->
</merge_tree>

Never run OPTIMIZE TABLE during business peaks.

Solution 4 – Clean Stuck Mutation Tasks

# Kill a specific mutation
clickhouse-client --query "KILL MUTATION WHERE database='db' AND table='tbl' AND mutation_id='ID'"

Prefer partition‑level deletes ( DROP PARTITION) over row‑level DELETE/UPDATE to reduce CPU load.

Solution 5 – Switch Compression to LZ4

# Create a new table with LZ4
CREATE TABLE new_tbl (col1 Type, col2 Type) ENGINE = MergeTree() ORDER BY pk SETTINGS codec = LZ4;

# Alter an existing column (requires table rebuild)
ALTER TABLE old_tbl MODIFY COLUMN col TYPE CODEC(LZ4);

Solution 6 – Batch Data Ingestion

Combine multiple rows into a single INSERT to avoid many tiny partitions:

# Bad: many single‑row inserts
INSERT INTO tbl VALUES (1);
INSERT INTO tbl VALUES (2);

# Good: batch insert
INSERT INTO tbl VALUES (1), (2), (3);

# Reduce insert thread count and use asynchronous inserts when appropriate
clickhouse-client --query "SET max_insert_threads=4";
INSERT INTO tbl ASYNC VALUES (...);

Long‑Term Monitoring – One‑Click Script

The following Bash script aggregates all the checks above and can be scheduled via cron to provide a periodic health report.

#!/bin/bash
CLICKHOUSE_CLIENT="clickhouse-client"

# Verify clickhouse-client availability
if ! command -v $CLICKHOUSE_CLIENT &>/dev/null; then
  echo "⚠️ clickhouse-client not found – monitoring disabled"
  exit 1
fi

echo "=================================================="
echo "📊 ClickHouse CPU monitoring report $(date '+%Y-%m-%d %H:%M:%S')"
echo "=================================================="

# 1. System CPU usage
echo "1. System CPU usage:"
top -bn1 | grep "Cpu(s)" | awk '{print "   User: "$2"% | System: "$4"% | Idle: "$8"%"}'

echo

# 2. ClickHouse process resource usage
CH_PID=$(pgrep clickhouse-server)
if [ -n "$CH_PID" ]; then
  top -bn1 -p $CH_PID | tail -1 | awk '{print "   PID: "$1" | CPU: "$9"% | MEM: "$10"%"}'
else
  echo "   ❌ ClickHouse process not running"
fi

echo

# 3. Current concurrent queries
echo "3. Current concurrent queries:"
$CLICKHOUSE_CLIENT --query "SELECT count() AS concurrent_queries FROM system.processes WHERE query != ''"

echo

# 4. Active background tasks (merges & mutations)
echo "4. Active background tasks:"
$CLICKHOUSE_CLIENT --query "SELECT 'Merges' AS task_type, count() AS active_count FROM system.merges UNION ALL SELECT 'Mutations' AS task_type, count() AS active_count FROM system.mutations WHERE is_done=0"

echo

# 5. Slow queries in the last hour (>1 s)
echo "5. Slow queries in the last hour (>1 s):"
$CLICKHOUSE_CLIENT --query "SELECT count() AS slow_queries FROM system.query_log WHERE type=2 AND event_time > now() - INTERVAL 1 HOUR AND query_duration_ms > 1000"

echo "=================================================="
echo "📋 Monitoring finished"
echo "=================================================="

Make the script executable and add a cron entry, e.g.:

* * * * * /path/to/clickhouse_cpu_monitor.sh >> /var/log/clickhouse_cpu_monitor.log 2>&1

Key Takeaways

When CPU spikes, first "stop the bleed" by killing heavy queries or limiting concurrency.

~90% of issues stem from query patterns; optimize SQL before tweaking server configs.

Deploy continuous monitoring to catch problems early rather than firefighting.

performance SQL ClickHouse Troubleshooting CPU

Written by

Xiao Liu Lab

An operations lab passionate about server tinkering 🔬 Sharing automation scripts, high-availability architecture, alert optimization, and incident reviews. Using technology to reduce overtime and experience to avoid major pitfalls. Follow me for easier, more reliable operations!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Emergency Triage – 3 Quick Steps

Step 2 – Check System Load

Step 3 – Identify Source

Precise Diagnosis – 4 Dimensions (covers ~90% of cases)

Dimension 1 – Running Queries

Dimension 2 – Historical Slow Queries

Dimension 3 – Background Merge & Mutation Tasks

Dimension 4 – Table Partitioning & Compression Settings

Optimization Solutions – 6 Practical Fixes

Solution 1 – Limit Query Concurrency

Solution 2 – Optimize Slow Queries

Solution 3 – Adjust Merge Settings

Solution 4 – Clean Stuck Mutation Tasks

Solution 5 – Switch Compression to LZ4

Solution 6 – Batch Data Ingestion

Long‑Term Monitoring – One‑Click Script

Key Takeaways

Xiao Liu Lab

How this landed with the community

Was this worth your time?

0 Comments

Step 2 – Check System Load

Step 3 – Identify Source

Dimension 1 – Running Queries

Dimension 2 – Historical Slow Queries

Dimension 3 – Background Merge & Mutation Tasks

Dimension 4 – Table Partitioning & Compression Settings

Solution 1 – Limit Query Concurrency

Solution 2 – Optimize Slow Queries

Solution 3 – Adjust Merge Settings

Solution 4 – Clean Stuck Mutation Tasks

Solution 5 – Switch Compression to LZ4

Solution 6 – Batch Data Ingestion