Deep Dive into Server Performance: Analyzing CPU, Memory, Disk, and Network Bottlenecks
This article explains how to identify and troubleshoot the four main resource bottlenecks—CPU, memory, disk I/O, and network—by detailing Linux internals, key metrics, practical command examples, real‑world case studies, and a step‑by‑step decision tree for accurate diagnosis and tuning.
CPU – Scheduling and Key Metrics
How CPU works in Linux
Linux schedules processes in time slices and balances load across cores. The run queue (the r column in vmstat) shows how many processes are waiting for CPU time.
Key CPU metrics
us– time spent in user mode. sy – time spent in kernel mode (system calls, scheduling, network stack). ni – time spent on low‑priority (nice) processes. id – idle time; low id alone does not prove a CPU bottleneck. wa – I/O wait; high wa signals an I/O bottleneck, not a CPU one. hi and si – hardware and soft interrupt time; high si often means heavy network traffic.
CPU diagnosis steps
Run top and compare us vs. id. us high + id low → user‑code issue; sy high + id low → kernel pressure.
Run vmstat 1 10 and check r. If r > 2 × CPU cores for several seconds, the run queue is saturated.
Identify the offending process with ps aux --sort=-%cpu | head -n 11 and drill down with top -p <PID> or pidstat -p <PID> 1 5.
Correlate with business context: a process that is expected to consume CPU (e.g., compression) may be normal; an unexpected spike in a web‑service process warrants deeper analysis.
Common pitfall
Zero idle does not guarantee a CPU bottleneck; if the run‑queue column r is also near zero, the CPU may be busy handling many short‑lived tasks or soft interrupts.
Memory – Physical and Virtual Resources
Memory management basics
Linux separates physical memory from virtual address space. The Buddy allocator manages pages in powers‑of‑two blocks, while the Slab allocator caches frequently used kernel objects. Page Cache stores file data in RAM; writes are cached and flushed by background threads. When RAM runs out, the kernel swaps out rarely used pages.
Key memory metrics
freeshows total free RAM, but available reflects memory that can be used by applications because it includes reclaimable cache.
In /proc/meminfo, MemAvailable = MemFree + reclaimable Page Cache + reclaimable Slab.
Swap activity ( si / so in vmstat) indicates physical memory pressure.
OOM Killer logs appear via dmesg | grep -i "out of memory".
Memory diagnosis steps
Check available vs. total memory; if available < 20 % of total and swap is increasing, memory is tight.
Watch si / so in vmstat. Persistent non‑zero values mean swapping.
Detect OOM Killer activation with dmesg | grep -i "out of memory" or dmesg | grep -i "killed process".
Inspect Slab usage with
cat /proc/slabinfo | awk '{print $1,$3,$4}' | sort -k3 -rn | head -20or slabtop. Large SUnreclaim may indicate kernel memory leaks.
For Java processes, examine native memory with jcmd <PID> VM.native_memory or jmap -heap because -Xmx alone does not reflect true RSS.
Case study – Server starts swapping
# 1. View memory and swap
free -h
# Mem: total=31G used=30G available=1.5G
# Swap: total=8G used=2G free=6G
# 2. Find processes using swap
for f in /proc/*/status; do awk '/VmSwap/{s=$2}/Name/{n=$2}END{if(s+0>0) print n, s}' "$f"; done | sort -k2 -rn | head -10
# mysqld 1024000 kB, java 614400 kB, redis 307200 kB
# 3. Verify memory consumption
ps aux | awk '/mysqld|java|redis/{print $2,$6,$11}'
# mysqld RSS=28G, java RSS=6G, redis RSS=4G (total > 31G)
# 4. Inspect MySQL buffer pool
mysql -u root -p -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"
# innodb_buffer_pool_size = 27GBConclusion: MySQL’s 27 GB buffer pool consumes most RAM; the remaining 2–3 GB are insufficient for Java and Redis, causing swap.
Disk I/O – Persistent Storage Bottlenecks
Linux I/O stack
Application issues read() / write() syscalls.
VFS abstracts the file‑system specifics.
File‑system layer (ext4, xfs, …) writes to Page Cache.
Block layer queues requests and performs scheduling.
Scheduler (cfq, deadline, noop) decides order.
Controller driver (SCSI/SATA/NVMe).
Physical medium (HDD/SSD/NVMe).
Key I/O metrics (iostat)
rrqm/s/ wrqm/s – merged requests per second. r/s / w/s – IOPS. rkB/s / wkB/s – throughput. avgrq‑sz – average request size (large = sequential). avgqu‑sz – average queue length; > 1 indicates backlog, > 10 severe congestion. await – average wait (queue + service) time; primary I/O health indicator. svctm – average service time (without queue). %util – device utilization; near 100 % means saturation (but on RAID/NVMe may be normal).
I/O bottleneck criteria
Mechanical disks: util > 80% and await > 50 ms.
Queue length avgqu‑sz > 2 sustained → request pile‑up.
SSD/NVMe: await > 10 ms is abnormal; util near 100 % often reflects concurrency limits.
Case study – Log‑induced I/O saturation
# 1. Confirm I/O problem
iostat -xz 1 3
# sda: util=98%, await=85ms, avgqu‑sz=45 → I/O bottleneck
# 2. Find process creating I/O
iotop -b -n 10 | head -30
# rsyslogd writes ~50 MB/s
# 3. Identify log files
lsof | grep rsyslog
# /var/log/messages, /var/log/nginx/access.log, …
# 4. Possible solutions (logrotate, move logs to SSD, adjust rsyslog buffer)
# $ActionFileEnableSync off
# $OMFileFlushInterval 1Network – Data Transfer Path
Linux network stack
Packets flow from NIC driver → ring buffer → soft interrupt (IRQ) → TCP/UDP/IP stack → socket buffers → application. Ring‑buffer overflow causes packet drops; soft‑interrupt CPU usage appears as si in vmstat. Socket buffers are sized via net.core.rmem_max and net.core.wmem_max.
Key network metrics
/proc/net/dev– bytes, packets, errors, drops per interface. ss -s – total connections, established, TIME_WAIT, orphan. netstat -s | grep Retransmit – retransmission count.
TCP TIME_WAIT count and port range ip_local_port_range affect short‑connection scalability.
Network diagnosis steps
Check interface traffic with cat /proc/net/dev or ip -s link show. High utilization indicates bandwidth exhaustion.
Inspect TIME_WAIT and orphan counts via ss -s. Excessive TIME_WAIT suggests enabling net.ipv4.tcp_tw_reuse=1.
Identify high‑interrupt load with si and, if concentrated on few cores, adjust IRQ affinity or enable irqbalance.
Use ethtool -S eth0 to view dropped packets and buffer overruns.
Case study – Connection refused errors
# 1. Test port connectivity
nc -zv 127.0.0.1 8080
# Ncat: Connection refused.
# 2. Verify listening socket
ss -tlnp | grep 8080
# No output → service not listening
# 3. Check process status
ps aux | grep java | grep -v grep
# No Java process → crashed
# 4. Look for crash logs
dmesg | grep -i java
# Segmentation fault in libcConclusion: The failure is due to the Java process crashing, not a network issue.
Cross‑Verification of the Four Resources
In production, bottlenecks often intertwine: high CPU may be caused by I/O wait; memory pressure triggers swap, increasing disk I/O; I/O saturation can block processes, raising load; network congestion can fill connection queues, consuming CPU and memory. Cross‑checking prevents mis‑diagnosis.
Typical cross‑scenarios
CPU us low but load high → likely I/O or network wait (check iostat and
vmstat bcolumn).
Free memory low but swap idle → normal Page Cache usage; monitor available instead.
I/O looks normal yet application latency high → investigate application‑level issues (slow queries, external service latency).
CPU sy high + heavy network traffic → soft‑interrupt overload; examine si and IRQ distribution.
Performance‑Issue Decision Tree
Server slow / stuck
│
├─ vmstat 1 3
│ ├─ r > CPU_cores*2 && us high → CPU compute bottleneck → ps → top -Hp → jstack/strace
│ ├─ b > 0 → I/O wait → iostat -x 1 → iotop → identify read/write heavy process
│ ├─ si/so > 0 continuously → memory shortage → free -h → locate memory‑hungry process → tune or add RAM
│ └─ cs high + sy high → excessive context switches → ss -s → optimize connection reuse
│
├─ top -b -n 1
│ ├─ us high → application code issue (algorithm, infinite loop)
│ ├─ sy high → many system calls (file/network)
│ ├─ si high → soft‑interrupt pressure (network bursts)
│ └─ wa high → disk or network I/O wait
│
└─ iostat -xz 1 3
├─ util > 80% + await > threshold → disk I/O bottleneck
├─ avgqu‑sz > 2 sustained → queue buildup
└─ w/s >> r/s → write‑heavy workload (logs, DB flush)System‑Parameter Tuning Summary
CPU tuning
# Bind process to specific cores
taskset -c 0-7 <PID>
# Adjust nice level
renice +10 <PID>
# Disable transparent hugepages (beneficial for some DB workloads)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defragMemory tuning
# Reduce swappiness (10‑30 for DB servers)
echo 10 > /proc/sys/vm/swappiness
echo "vm.swappiness=10" >> /etc/sysctl.conf
# Dirty page write‑back thresholds
echo 80 > /proc/sys/vm/dirty_ratio
echo 20 > /proc/sys/vm/dirty_background_ratio
# Raise file descriptor limits
echo "root soft nofile 655350" >> /etc/security/limits.conf
echo "root hard nofile 655350" >> /etc/security/limits.conf
ulimit -n 655350
# Increase max mmap count for Java
echo 655350 > /proc/sys/vm/max_map_countDisk I/O tuning
# Change I/O scheduler (SSD: deadline or noop)
cat /sys/block/sda/queue/scheduler # view current
echo deadline > /sys/block/sda/queue/scheduler
# Increase request queue depth
echo 2048 > /sys/block/sda/queue/nr_requests
# Adjust read‑ahead size (e.g., 4 MiB)
blockdev --setra 8192 /dev/sda
# Set I/O priority for a process
ionice -c 2 -n 0 -p <PID> # real‑time class
ionice -c 3 -p <PID> # idle classNetwork tuning
# sysctl tuning (apply with sysctl -p)
net.core.somaxconn = 655350
net.core.netdev_max_backlog = 655350
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_max_syn_backlog = 655350
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_congestion_control = bbr
# Adjust IRQ affinity for multi‑queue NICs
cat /proc/irq/73/smp_affinity # example mask
systemctl enable irqbalance && systemctl start irqbalanceMonitoring and Baselines
Collect metrics with sar, node_exporter + Prometheus + Grafana, or netdata. Establish normal baselines for CPU, memory, disk, and network, then configure alert thresholds (e.g., CPU > 80 % for 5 min, load > 1.5 × cores, disk await > 50 ms for HDD, swap usage > 0, etc.).
Quick Index of Common Symptoms
High load but high CPU idle: likely disk or network wait; check iostat and
vmstat bcolumn.
High load and high sy : kernel‑mode pressure from many syscalls or short‑lived processes; examine context switches and connection counts.
Memory high but no swap: normal Page Cache usage; monitor available for true pressure.
Disk write heavy yet util normal: application‑level write intensity; consider log rotation or write buffering.
Application slow while system metrics normal: investigate application logs, slow SQL queries, external service latency.
Network packet loss with low link utilization: ring‑buffer overflow, TCP retransmits, or insufficient socket buffers; tune net.core.rmem_max / net.core.wmem_max and inspect ethtool -S stats.
Periodic slowdown: scheduled cron jobs, backup scripts, or DB maintenance tasks; correlate timestamps with pidstat or iotop snapshots.
Conclusion
The four resource dimensions share a common troubleshooting mindset: first pinpoint the bottleneck, then locate the process or code causing it. Tools like vmstat, iostat, top, free, and ss are the starting point, and understanding what each metric truly represents is essential for accurate judgment. Cross‑validation prevents misinterpretation—CPU idle low does not always mean a CPU problem, and low free memory may simply be healthy cache usage. Proactive monitoring with baselines and sensible alerts reduces mean‑time‑to‑detect and keeps services reliable.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
