Troubleshooting High System Load Caused by Stuck NFS Processes and Zabbix Monitoring
This article details a step‑by‑step analysis of a server experiencing sustained high load, uncovering a stuck NFS mount and an overactive Zabbix monitoring job, and explains how targeted process termination and monitoring adjustments reduced the load from 85 to normal levels.
1 Fault Phenomenon
The machine showed a consistently high load around 85 for an extended period; the host has 10 cores with 2 threads each, and short‑term monitoring showed no obvious load fluctuations.
Extending the monitoring period to three months revealed a slow upward trend with a tailing inflection point; the load dropped sharply after the issue was resolved.
CPU idle rate remained stable around 60%.
Disk I/O busy rate was consistently high without obvious variation.
Memory usage was low and stable, with a slight decrease after the fault was fixed.
2 Fault Analysis
Monitoring graphs showed no clear issues, and the DB layer appeared normal. Logging into the machine and running top revealed no process with unusually high resource consumption, but the following anomalies were discovered:
System CPU Usage High
System CPU usage spiked to about 20% after May 14, with a tailing inflection point that improved after optimization.
Abnormal Processes
In the top output, a df command process was observed. Normally df returns quickly and should not appear in top . Manually executing df caused it to hang and not exit.
Based on experience, this indicated an NFS file system whose server was unreachable. Checking /etc/fstab showed an NFS mount on /backup . Attempting umount reported the device busy, and fuser -m -v revealed many processes using the NFS.
Several processes could not be killed with kill -9 . Using umount -l performed a lazy unmount, after which the stuck processes were cleared and the load dropped from 80 to 10.
The remaining question was why system CPU usage reached 20%.
Using atop with the default 10‑second interval, the #exit counter reached 20,000, indicating a large number of short‑lived processes and high system call activity. A Zabbix user was identified among these processes; the DB layer was fine, pointing to Zabbix as the suspect. A simple loop was used to trace Zabbix processes:
while true; do ps -ef|grep zabbix; sleep 2; done;It was discovered that a particular auto‑discovery monitoring item spawned over 1000 concurrent script executions every 30 seconds, overwhelming the system. Disabling this monitoring item reduced system CPU usage from 20% to below 2%.
3 Summary
The high system load was accompanied by an unexpected rise in SYS CPU usage. Unlike typical cases, the load increased gradually without obvious CPU, memory, or I/O spikes, requiring careful observation with multiple monitoring tools to pinpoint the root cause.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.