Master Server Monitoring: Diagnose CPU, Memory, Disk & TCP Alerts
This guide explains how to identify and resolve common server monitoring alerts—including CPU, memory, disk space, disk I/O, and TCP connection issues—using Linux commands such as top, df, du, iotop, and netstat, and provides practical remediation steps.
Server Monitoring Metrics
During routine server inspections, various alerts arise from different servers and monitoring tools (e.g., Zabbix, Prometheus + Grafana). This article focuses on resource‑related alerts and outlines common handling approaches.
CPU Alerts
Use top to view processes, press Shift+P to sort by CPU usage, and examine the %CPU column. Note that the displayed CPU usage is per core; 100% represents full load of a single core (e.g., a 4‑core server can reach 400%).
# top
Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.2 us, 1.3 sy, 0.0 ni, 97.3 id, 0.2 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 8008984 total, 1046216 free, 4712336 used, 2250432 buff/cache
KiB Swap: 7208956 total, 4409068 free, 2799888 used. 2373196 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1456 root 20 0 10.5g 361648 242164 S 3.0 4.5 12461:08 clickhouse-server --config-file=/etc/clickhouse-+
1089 root 20 0 5755452 238580 2644 S 1.7 3.0 4330:47 java -jar V2XRealtimeServer.jar
1086 root 20 0 5822324 319628 3028 S 1.3 4.0 4161:58 java -jar V2XRawDataServer.jar
10174 root 20 0 5819584 963512 4420 S 1.3 12.0 3619:07 java -jar V2XWebSocketServer.jar
2105 mysql 20 0 3205688 907124 7584 S 0.7 11.3 1462:50 /usr/sbin/mysqld --daemonize --pid-file=/var/run+
1090 root 20 0 148952 4648 780 S 0.3 0.1 420:01.32 /usr/local/redis/bin/redis-server 0.0.0.0:7379 [+]
17013 root 20 0 162128 2344 1600 R 0.3 0.0 0:00.04 top
1 root 20 0 125516 2636 1492 S 0.0 0.0 133:31.76 /usr/lib/systemd/systemd --switched-root --syste+Typical scenarios:
Continuous alerts for compute‑intensive applications (e.g., data cleaning, transformation).
Transient alerts below 70% CPU usage that do not affect system responsiveness.
Increasing frequency of alerts, possibly due to bugs or vulnerabilities.
Time‑bound spikes often linked to business traffic peaks.
Common remedies:
Adjust application configuration (thread count, concurrency) to limit resource consumption.
Patch known vulnerabilities or upgrade the component.
Balance traffic via clustering, caching, load balancing, or schedule adjustments.
Scale up CPU resources or migrate services to higher‑performance servers.
Memory Alerts
Run top and press Shift+M to sort by memory usage, focusing on the RES and %MEM columns.
# top
Tasks: 195 total, 1 running, 194 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.3 us, 1.1 sy, 0.0 ni, 97.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8008984 total, 969272 free, 4721960 used, 2317752 buff/cache
KiB Swap: 7208956 total, 4409068 free, 2799888 used. 2363556 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10174 root 20 0 5819584 963512 4420 S 1.3 12.0 3619:52 java -jar V2XWebSocketServer.jar
10166 root 20 0 5768092 921932 4252 S 0.0 11.5 364:51.16 java -jar V2XStatisticsServer.jar
2105 mysql 20 0 3205688 907124 7584 S 0.0 11.3 1463:03 /usr/sbin/mysqld --daemonize --pid-file=/var/run+
1087 root 20 0 5809328 449920 2736 S 0.0 5.6 226:25.74 java -jar V2XApiServer.jar
1456 root 20 0 10.5g 369520 242164 S 3.0 4.6 12463:01 clickhouse-server --config-file=/etc/clickhouse-+
1086 root 20 0 5822324 319628 3028 S 1.3 4.0 4162:45 java -jar V2XRawDataServer.jar
1064 root 20 0 5702928 286440 2272 S 0.3 3.6 721:06.60 java -jar msbus.jar
1089 root 20 0 5755452 238580 2644 S 1.7 3.0 4331:30 java -jar V2XRealtimeServer.jar
27891 root 20 0 1111052 25192 2324 S 0.0 0.3 4:21.71 /usr/bin/dockerd -H fd:// --containerd=/run/cont+Typical remedies:
Tune application parameters to limit memory usage, cache size, queue length, etc.
Expand server memory or migrate services to machines with higher memory capacity.
Disk Space Alerts
Use df -h to view partition usage (focus on Use% and Mounted on), then du -sh to locate directories consuming the most space.
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/klas_host--10--169--183--49-root 95G 9.6G 86G 11% /
/dev/vda2 1014M 217M 798M 22% /boot
/dev/vda1 200M 5.8M 195M 3% /boot/efi
/dev/mapper/vgdata-lvdata 100G 56G 45G 56% /data
# du -sh /data/*
4.6M /data/h5
40M /data/ioc-guanai
242M /data/jdk
54G /data/jnpf
5.2M /data/redis
952M /data/softCommon solutions:
Rotate or delete large log files (e.g., using logrotate).
For data‑disk saturation, limit data retention time, enable compression, or adjust application parameters.
If the system partition is full, move applications or logs to the data disk, or reconfigure Docker storage.
Scale disk capacity by adding or expanding dedicated data disks.
Disk I/O Alerts
Install iotop (via yum on CentOS) and run iotop -o to see processes with the highest I/O. Columns of interest: SWAPIN (swap usage) and IO> (I/O wait).
# iotop -o
Total DISK READ : 0.00 B/s | Total DISK WRITE : 388.00 K/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 633.68 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
518 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.16 % [xfsaild/dm-0]
20271 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/3:2]
2178 be/4 root 0.00 B/s 407.08 B/s 0.00 % 0.00 % java -jar V2XRawDataServer.jar
...Typical patterns mirror CPU alerts: continuous I/O for storage‑intensive apps, occasional spikes below 70% that are tolerable, increasing frequency indicating bugs, or time‑bound spikes during traffic peaks.
Remediation strategies:
Limit per‑application performance (threads, concurrency, cache settings).
Patch vulnerable components or upgrade versions.
Balance traffic via clustering, caching, load balancing, or scheduling.
Upgrade disk performance (e.g., SSD) or migrate services to faster storage.
TCP Connection Alerts
Run netstat -antp to count connections by state. Focus on ESTABLISHED and TIME_WAIT. Excessive ESTABLISHED connections indicate the need to scale the service; many TIME_WAIT connections can exhaust socket resources under high short‑lived connection loads.
Solutions:
For server‑side overload, deploy multiple nodes, load balancers, or split services.
For client‑side overload, use connection pooling or distribute clients across nodes.
Mitigate TIME_WAIT by enabling TCP keep‑alive, using long‑lived connections, and tuning kernel parameters (e.g., net.ipv4.tcp_tw_reuse=1, net.ipv4.tcp_tw_recycle=1, net.ipv4.tcp_fin_timeout=30).
# vim /etc/sysctl.conf
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_fin_timeout=30
# sysctl -pAppendix: TCP Statistics Commands
Count connections by state:
# netstat -antp | awk -F '[ /]+' 'NR>2 {count[$6]++} END {for(state in count) print state, "\t\t", count[state] }'
LISTEN 16
CLOSE_WAIT 2
ESTABLISHED 273
FIN_WAIT2 1
TIME_WAIT 1Count processes per state (e.g., ESTABLISHED):
# netstat -antp | grep -i established | awk -F '[ /]+' '{count[$8]++} END {for(app in count) print app, "\t\t", count[app] }'
java 124
mysqld 109
clickhouse-ser 6
sshd: 1
redis-server 31MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
