How Didi Cut ClickHouse CPU Usage by 90% with a Simple Thread Check Fix
This article walks through how Didi identified excessive CPU consumption by ClickHouse background move threads, diagnosed the root cause using top and pstack, and applied a lightweight code guard that reduced CPU load from 30% to under 5%, improving overall cluster performance.
01 Discover Issue
Didi observed high CPU load on online ClickHouse nodes and used top to pinpoint the most CPU‑intensive processes, finding that several BackgrProcPool and HTTPHandler threads were at the top of the list.
The BackgrProcPool threads handle ReplicatedMergeTree merges and mutations, while HTTPHandler threads process client queries.
Further inspection with top -Hp <pid> showed eight BgMoveProcPool threads constantly consuming CPU, despite disk usage being only around 80% (the alert threshold is 90%).
02 Confirm Issue
To understand what the BgMoveProcPool threads were doing, Didi captured their stack traces with pstack <pid>. The stacks repeatedly pointed to the MergeTreePartsMover::selectPartsForMove method.
#0 0x00000000100111a4 in DB::MergeTreePartsMover::selectPartsForMove(...)
#1 0x000000000ff6ef5a in DB::MergeTreeData::selectPartsForMove()
#2 0x000000000ff86096 in DB::MergeTreeData::selectPartsAndMove()
#3 ... (additional stack frames omitted for brevity)Querying system.part_log for recent MovePart events returned no rows, confirming that the threads were performing useless calculations.
SELECT * FROM system.part_log WHERE event_time > now() - toIntervalDay(1) AND event_type = 'MovePart';Reviewing the source of selectPartsForMove revealed three main steps: (1) collect disks whose usage exceeds a threshold, (2) iterate over all parts to find those needing movement or candidates, and (3) reserve space for candidate parts. The second step became the performance bottleneck, especially with over 300,000 parts.
bool MergeTreePartsMover::selectPartsForMove(MergeTreeMovingParts & parts_to_move, const AllowedMovingPredicate & can_move, const std::lock_guard<std::mutex> & moving_parts_lock) {
// 1. Find disks over usage threshold
// 2. Scan all parts, check MoveTTL and candidate conditions
// 3. Reserve space for candidates
}03 Solve Issue
Since the cluster had ample disk space and no hot‑cold tiering configured, the expensive scan was unnecessary. Didi added a guard at the start of selectPartsForMove to skip the whole routine when there is nothing to move.
if (need_to_move.empty() && !metadata_snapshot->hasAnyMoveTTL())
return false;04 Actual Effect
After deploying the change to the public cluster, the eight BgMoveProcPool threads dropped from ~30% CPU usage to below 4%.
Overall node CPU utilization fell from around 20% to 10%, and CPU spikes were significantly reduced.
05 Future Thoughts
The experience highlights that code which works under low load can become a bottleneck at scale; developers should anticipate high data volume and concurrency. Didi will continue to improve ClickHouse for massive PB‑level log retrieval, aiming for stable, low‑cost, high‑throughput performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiao Lou's Tech Notes
Backend technology sharing, architecture design, performance optimization, source code reading, troubleshooting, and pitfall practices
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
