Diagnosing High CPU Load Caused by Frequent Short‑Lived Processes in a MongoDB Environment Using execsnoop
The article describes how a MongoDB test environment on a single VM experienced persistent high CPU load despite low visible QPS, how the root cause was traced to thousands of short‑lived processes spawned by Zabbix monitoring, and how execsnoop was used to identify and eliminate the offending processes.
Background
A test environment on a single virtual machine runs one MongoDB cluster consisting of a mongos, three config nodes, and a single shard with three replica set members (total 7 MongoDB instances) on CentOS 7.9, MongoDB 4.2.19.
After the test, CPU load stayed around 50% while MongoDB QPS dropped to zero. Stopping all MongoDB instances immediately restored normal CPU usage; restarting them caused the load to rise again, confirming the issue is tied to MongoDB.
Diagnosis
Running top showed user CPU at about 40%, but the sum of %CPU of the top processes was far lower than the total load.
Checking mongos QPS confirmed no user commands were being executed.
Using dstat and perf record -ag -- sleep 10 && perf report showed normal metrics for most indicators; the problem appeared to be many short‑lived processes that top could not capture.
Running sar -w 1 revealed that more than 80 new processes were being created each second.
To capture these fleeting processes, the article recommends using execsnoop , a tool based on ftrace that logs each exec() call with PID/PPID and command line arguments.
#下载execsnoop#
cd /usr/bin
wget https://raw.githubusercontent.com/brendangregg/perf‐tools/master/execsnoop
chmod 755 execsnoopRunning execsnoop for 10 seconds produced over 400 records, each representing a short‑lived process that connected to MongoDB and then performed a grep on the output.
Stopping the zabbix process instantly normalized CPU usage, identifying it as the culprit. The VM runs seven MongoDB instances, and Zabbix monitors each one, spawning monitoring tasks seven times on a 4‑core VM, which amplified the overhead.
The issue was resolved by disabling Zabbix monitoring for this node and planning to optimise monitoring logic to reduce database connection frequency and grep call chains.
Conclusion
When CPU load remains high but top cannot reveal the responsible processes, tools like execsnoop (or iosnoop, opensnoop) can capture short‑lived processes and help pinpoint the root cause.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.