Databases 9 min read

How Didi Cut ClickHouse CPU Usage by 90% with a Simple Thread Check Fix

This article walks through how Didi identified excessive CPU consumption by ClickHouse background move threads, diagnosed the root cause using top and pstack, and applied a lightweight code guard that reduced CPU load from 30% to under 5%, improving overall cluster performance.

Xiao Lou's Tech Notes

Jul 7, 2023

How Didi Cut ClickHouse CPU Usage by 90% with a Simple Thread Check Fix

01 Discover Issue

Didi observed high CPU load on online ClickHouse nodes and used top to pinpoint the most CPU‑intensive processes, finding that several BackgrProcPool and HTTPHandler threads were at the top of the list.

The BackgrProcPool threads handle ReplicatedMergeTree merges and mutations, while HTTPHandler threads process client queries.

Further inspection with top -Hp <pid> showed eight BgMoveProcPool threads constantly consuming CPU, despite disk usage being only around 80% (the alert threshold is 90%).

02 Confirm Issue

To understand what the BgMoveProcPool threads were doing, Didi captured their stack traces with pstack <pid>. The stacks repeatedly pointed to the MergeTreePartsMover::selectPartsForMove method.

#0  0x00000000100111a4 in DB::MergeTreePartsMover::selectPartsForMove(...)
#1  0x000000000ff6ef5a in DB::MergeTreeData::selectPartsForMove()
#2  0x000000000ff86096 in DB::MergeTreeData::selectPartsAndMove()
#3  ... (additional stack frames omitted for brevity)

Querying system.part_log for recent MovePart events returned no rows, confirming that the threads were performing useless calculations.

SELECT * FROM system.part_log WHERE event_time > now() - toIntervalDay(1) AND event_type = 'MovePart';

Reviewing the source of selectPartsForMove revealed three main steps: (1) collect disks whose usage exceeds a threshold, (2) iterate over all parts to find those needing movement or candidates, and (3) reserve space for candidate parts. The second step became the performance bottleneck, especially with over 300,000 parts.

bool MergeTreePartsMover::selectPartsForMove(MergeTreeMovingParts & parts_to_move, const AllowedMovingPredicate & can_move, const std::lock_guard<std::mutex> & moving_parts_lock) {
    // 1. Find disks over usage threshold
    // 2. Scan all parts, check MoveTTL and candidate conditions
    // 3. Reserve space for candidates
}

03 Solve Issue

Since the cluster had ample disk space and no hot‑cold tiering configured, the expensive scan was unnecessary. Didi added a guard at the start of selectPartsForMove to skip the whole routine when there is nothing to move.

if (need_to_move.empty() && !metadata_snapshot->hasAnyMoveTTL())
    return false;

04 Actual Effect

After deploying the change to the public cluster, the eight BgMoveProcPool threads dropped from ~30% CPU usage to below 4%.

Overall node CPU utilization fell from around 20% to 10%, and CPU spikes were significantly reduced.

05 Future Thoughts

The experience highlights that code which works under low load can become a bottleneck at scale; developers should anticipate high data volume and concurrency. Didi will continue to improve ClickHouse for massive PB‑level log retrieval, aiming for stable, low‑cost, high‑throughput performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems performance optimization ClickHouse CPU

Written by

Xiao Lou's Tech Notes

Backend technology sharing, architecture design, performance optimization, source code reading, troubleshooting, and pitfall practices

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.