Operations 6 min read

Troubleshooting High System Load Caused by Stuck NFS Processes and Zabbix Monitoring

This article details a step‑by‑step analysis of a server experiencing sustained high load, uncovering a stuck NFS mount and an overactive Zabbix monitoring job, and explains how targeted process termination and monitoring adjustments reduced the load from 85 to normal levels.

Aikesheng Open Source Community

Dec 13, 2023

Troubleshooting High System Load Caused by Stuck NFS Processes and Zabbix Monitoring

1 Fault Phenomenon

The machine showed a consistently high load around 85 for an extended period; the host has 10 cores with 2 threads each, and short‑term monitoring showed no obvious load fluctuations.

Extending the monitoring period to three months revealed a slow upward trend with a tailing inflection point; the load dropped sharply after the issue was resolved.

CPU idle rate remained stable around 60%.

Disk I/O busy rate was consistently high without obvious variation.

Memory usage was low and stable, with a slight decrease after the fault was fixed.

2 Fault Analysis

Monitoring graphs showed no clear issues, and the DB layer appeared normal. Logging into the machine and running top revealed no process with unusually high resource consumption, but the following anomalies were discovered:

System CPU Usage High

System CPU usage spiked to about 20% after May 14, with a tailing inflection point that improved after optimization.

Abnormal Processes

In the top output, a df command process was observed. Normally df returns quickly and should not appear in top. Manually executing df caused it to hang and not exit.

Based on experience, this indicated an NFS file system whose server was unreachable. Checking /etc/fstab showed an NFS mount on /backup. Attempting umount reported the device busy, and fuser -m -v revealed many processes using the NFS.

Several processes could not be killed with kill -9. Using umount -l performed a lazy unmount, after which the stuck processes were cleared and the load dropped from 80 to 10.

The remaining question was why system CPU usage reached 20%.

Using atop with the default 10‑second interval, the #exit counter reached 20,000, indicating a large number of short‑lived processes and high system call activity. A Zabbix user was identified among these processes; the DB layer was fine, pointing to Zabbix as the suspect. A simple loop was used to trace Zabbix processes:

while true; do ps -ef|grep zabbix; sleep 2; done;

It was discovered that a particular auto‑discovery monitoring item spawned over 1000 concurrent script executions every 30 seconds, overwhelming the system. Disabling this monitoring item reduced system CPU usage from 20% to below 2%.

3 Summary

The high system load was accompanied by an unexpected rise in SYS CPU usage. Unlike typical cases, the load increased gradually without obvious CPU, memory, or I/O spikes, requiring careful observation with multiple monitoring tools to pinpoint the root cause.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux System Monitoring performance troubleshooting NFS Zabbix high load

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.