How to Diagnose and Fix Extreme ClickHouse Load Spikes in Production
A production ClickHouse cluster suddenly showed blacked‑out dashboards due to CPU load soaring above 2,700%, and this guide walks through step‑by‑step diagnostics using system tables, a simple query to spot heavy SQL, and practical remediation actions to restore normal load levels.
Problem Overview
During a first‑stage acceptance test, the monitoring dashboard for a project went completely black, indicating that the backend could no longer supply data. The root cause was a massive CPU load on a ClickHouse (CK) server: the load metric jumped to over 2,700% on a 64‑core machine, more than 400 times its theoretical maximum.
The high load was not caused by I/O wait (the wa metric was near zero), so it was pure CPU consumption.
Identifying the Culprit
Typical troubleshooting follows the chain: microservice → database → processing job → upstream data source . After confirming the microservice was healthy, the database tables were inspected and the overloaded CK node was identified.
ClickHouse provides two system tables that record query activity:
system.processes
system.query_log
Complex joins on these tables often run slowly on an already overloaded node, so the solution is to use a very simple query that directly lists the most resource‑intensive SQL statements without any sorting or additional joins.
Remediation Steps
Step 1: Identify the user who submitted the heavy SQL and review the query for inefficiencies such as missing indexes or unnecessary columns.
Step 2: Kill the offending queries using their query_id obtained from the simple query.
Step 3: If killed queries reappear due to scheduled jobs, temporarily disable the affected tables ("take them offline") to prevent immediate re‑execution.
Long‑Term Optimizations
Beyond the immediate fix, the article recommends three lasting measures:
Rewrite inefficient SQL, add appropriate indexes, and avoid selecting unnecessary columns.
Convert frequently accessed large tables to sharded tables so the load is distributed across multiple nodes.
Redistribute small “local” tables that were previously stored on a single server to multiple machines, preventing a single point of overload.
After applying these steps, the server’s load stabilized around 30, returning to normal operation.
The case illustrates that while many database performance issues follow similar patterns, ClickHouse’s cluster architecture can introduce slightly higher operational overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
