Databases 7 min read

How to Diagnose and Fix Extreme ClickHouse Load Spikes in Production

A production ClickHouse cluster suddenly showed blacked‑out dashboards due to CPU load soaring above 2,700%, and this guide walks through step‑by‑step diagnostics using system tables, a simple query to spot heavy SQL, and practical remediation actions to restore normal load levels.

dbaplus Community
dbaplus Community
dbaplus Community
How to Diagnose and Fix Extreme ClickHouse Load Spikes in Production

Problem Overview

During a first‑stage acceptance test, the monitoring dashboard for a project went completely black, indicating that the backend could no longer supply data. The root cause was a massive CPU load on a ClickHouse (CK) server: the load metric jumped to over 2,700% on a 64‑core machine, more than 400 times its theoretical maximum.

The high load was not caused by I/O wait (the wa metric was near zero), so it was pure CPU consumption.

Identifying the Culprit

Typical troubleshooting follows the chain: microservice → database → processing job → upstream data source . After confirming the microservice was healthy, the database tables were inspected and the overloaded CK node was identified.

ClickHouse provides two system tables that record query activity:

system.processes

system.query_log

Complex joins on these tables often run slowly on an already overloaded node, so the solution is to use a very simple query that directly lists the most resource‑intensive SQL statements without any sorting or additional joins.

Remediation Steps

Step 1: Identify the user who submitted the heavy SQL and review the query for inefficiencies such as missing indexes or unnecessary columns.

Step 2: Kill the offending queries using their query_id obtained from the simple query.

Step 3: If killed queries reappear due to scheduled jobs, temporarily disable the affected tables ("take them offline") to prevent immediate re‑execution.

Long‑Term Optimizations

Beyond the immediate fix, the article recommends three lasting measures:

Rewrite inefficient SQL, add appropriate indexes, and avoid selecting unnecessary columns.

Convert frequently accessed large tables to sharded tables so the load is distributed across multiple nodes.

Redistribute small “local” tables that were previously stored on a single server to multiple machines, preventing a single point of overload.

After applying these steps, the server’s load stabilized around 30, returning to normal operation.

The case illustrates that while many database performance issues follow similar patterns, ClickHouse’s cluster architecture can introduce slightly higher operational overhead.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

load balancingClickHouseSQL OptimizationDatabase PerformanceSystem Tables
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.