How to Diagnose Slow Server Responses: Full‑Scope CPU, Memory, Disk & Network Analysis
This guide walks Linux operators through a systematic, four‑dimensional investigation of server slowdown—covering CPU, memory, disk I/O, and network—using concrete commands, diagnostic scripts, real‑world scenarios, and step‑by‑step remediation strategies to pinpoint and resolve performance bottlenecks.
Quickly Identify the Bottleneck Resource
When a server becomes sluggish, first determine which of the four core resources (CPU, memory, disk I/O, network) is limiting.
1.1 Use top for a global view
# top -bn1
# Observe the first three lines:
# 10:15:32 up 45 days, 3:22, 2 users, load average: 12.5, 10.2, 8.0
# Tasks: 1234 total, 4 running, 1230 sleeping, 0 stopped, 0 zombie
# %Cpu(s): 15.2 us, 3.1 sy, 0.0 ni, 81.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 stThe three lines show:
Load average vs. CPU core count
Process state distribution
CPU usage percentages
1.2 Use vmstat for overall system metrics
# vmstat 1 5 # output once per second, five times
# Columns: r b swpd free buff cache si so bi bo in cs us sy id wa stKey column meanings:
r: running processes (≈ 1‑minute load average)
b: processes blocked on I/O (D state)
swpd: used swap (KB)
free: free memory (KB)
buff: buffer memory (KB)
cache: cache memory (KB)
si: swap‑in rate (KB/s)
so: swap‑out rate (KB/s)
bi: blocks received per second
bo: blocks sent per second
us: user‑space CPU usage
sy: system‑space CPU usage
id: idle CPU percentage (lower = busier)
wa: CPU time waiting for I/O (higher = I/O bottleneck)Judgement rules:
r > CPU core count → CPU queue backlog
b > 5 → severe I/O wait
wa > 20% → I/O bottleneck (check together with id)
si/so > 0 → memory pressure, swap in use
free consistently low → memory tight1.3 Use iostat for disk I/O details
# iostat -xz 1 5 # detailed I/O every second, five samples
# Important fields:
# r/s, w/s – IOPS (reads/writes per second)
# rkB/s, wkB/s – throughput (KB/s)
# avgqu‑sz – average queue length (>1 indicates wait, >5 severe)
# await – average I/O wait time (ms, >20 ms slow)
# %util – device utilization (>80 % indicates disk is a bottleneck)If %util approaches 100 %, the disk is the bottleneck.
1.4 Use free for memory status
# free -m
# total used free shared buff/cache available
# Mem: 32000 28000 4000 2000 8000 2000
# Swap: 8192 0 8192Judgement rules:
available < total * 10% → memory tight
available stays low → memory shortage
swap used > 0 → system is swapping (memory insufficient)1.5 Quick‑diagnosis script
#!/bin/bash
# server_quick_diag.sh – locate performance bottleneck within a minute
echo "===== System Overview ====="
uptime
echo ""
echo "===== CPU Status ====="
nproc
vmstat 1 2 | tail -1
echo ""
echo "===== Memory Status ====="
free -m
echo ""
echo "===== Disk I/O ====="
iostat -xz 1 2 | tail -20
echo ""
echo "===== Network Status ====="
sar -n DEV 1 2 | grep -E "^Average|^Linux" | tail -10
echo ""
echo "===== Top 5 CPU Processes ====="
ps aux --sort=-%cpu | head -7
echo ""
echo "===== Top 5 Memory Processes ====="
ps aux --sort=-%mem | head -7
echo ""
echo "===== Top 5 I/O Processes ====="
ps aux --sort=-%mem | head -7CPU Bottleneck Investigation
2.1 Confirm whether CPU is truly the bottleneck
# Show per‑core CPU usage
top -bn1 | head -20
# High "us" → user‑space consumption
# High "sy" → kernel‑space consumption (many syscalls, context switches)
# Compare load average with core count
nproc # number of CPU cores
uptime # load average
# Load average > core count → processes queuing for CPU2.2 Find the processes consuming the most CPU
# Interactive view
top # press Shift+P to sort by CPU
# Non‑interactive sorting
ps aux --sort=-%cpu | head -20
# Filter out defunct processes
ps aux | grep -v defunct | sort -k3nr | head -202.3 Analyse CPU consumption of a specific process
# Thread count (useful for multi‑process apps)
ps -eLf | grep <pid> | wc -l
# CPU usage trend (requires sysstat)
sar -p 1 60 > /tmp/cpu_sar.log
# Process priority and nice value
ps -eo pid,ni,pri,pcpu,comm | grep <pid>
# ni: nice value (-20..19, lower = higher priority)2.4 Common CPU bottleneck scenarios
Scenario 1: A single business process consumes 100 % CPU
# Identify the process
ps aux --sort=-%cpu | head -10
# Java process – inspect thread stacks
jstack <pid> > /tmp/jstack.log
# Python process – GIL limitation
ps -eLf | grep <pid> | wc -l # thread count
# Remedy: use multiprocessing instead of multithreading
# Nginx/PHP‑FPM – adjust pm.max_children, pm.start_servers
ps -eLf | grep php-fpm | wc -lScenario 2: Massive short‑lived processes cause high scheduling overhead
# Check process creation rate
cat /proc/loadavg # fifth field = last created PID
# Rapid PID growth indicates many short‑lived processes
ps aux | wc -l
# Identify which user spawns them
ps -eo user,pid,cmd | awk '{users[$1]++} END {for (u in users) print u": "users[u]}' | sort -t: -k2 -nr | headScenario 3: Excessive context switches
# Context switch count per second
vmstat 1 # "cs" column
# Which process has the most switches
pidstat -w 1 5 # cswch/s (voluntary), nvcswch/s (involuntary)2.5 CPU bottleneck remediation
# 1. Adjust process priority temporarily
sudo renice -n -10 -p <pid>
# 2. Limit CPU via cgroups
# Edit /etc/cgconfig.conf
group limit_cpu {
cpu {
cpu.cfs_quota_us = 50000; # 50 % of a CPU
cpu.cfs_period_us = 100000;
}
}
# 3. If single‑threaded, consider multi‑process or multiple instances
# 4. Upgrade to more cores or higher frequency CPUs
# 5. Bind process to specific cores
sudo taskset -p -c 0,1,2,3 <pid>Memory Bottleneck Investigation
3.1 Confirm whether memory is the bottleneck
# Show memory usage
free -m
# Focus on "available" (free + buff/cache – unreclaimable)
# If available < 10 % of total → memory pressure
# Check swap usage
swapon -s # any used swap indicates insufficient RAM3.2 Find processes consuming the most memory
# Sort by memory usage
ps aux --sort=-%mem | head -20
# Detailed memory map of a process
ps -p <pid> -o pid,vsz,rss,comm
# Top memory consumers via top
top # press Shift+M to sort by memory3.3 Analyse memory consumption of a process
# Read /proc/<pid>/status for detailed mapping
cat /proc/<pid>/status | grep -E "Vm|Rss|Pid"
# Example output:
# VmPeak: 524288 kB # peak virtual memory
# VmSize: 524288 kB # current virtual memory
# VmRSS: 102400 kB # resident physical memory
# VmData: 409600 kB # heap size
# Track memory trend (requires sar or custom script)
# Record RSS every minute with ps aux3.4 Common memory bottleneck scenarios
Scenario 1: Memory leak
# Monitor RSS growth
watch -n 1 "ps -p <pid> -o pid,vsz,rss,comm"
# If RSS continuously rises, a leak is present
# Java leak detection
jstat -gc <pid> 1s # monitor GC stats; OU growing without drop indicates leakScenario 2: OOM Killer activation
# Search kernel logs for OOM events
sudo dmesg | grep -i "out of memory"
sudo journalctl -xb | grep -i "killed process"
# Identify victim process and OOM score
ps -eo pid,comm,oom_score | sort -k3 -nr | head -20
# Lower OOM score to protect a process
sudo bash -c 'echo -1000 > /proc/<pid>/oom_score_adj'Scenario 3: Heavy swap usage
# Show swap usage
swapon -s
free -m
# Persistent swap‑in/out indicates memory pressure
vmstat 1 5 # watch "si" and "so"
# Find processes using swap (Linux ≥2.6.34)
for f in /proc/*/status; do awk '/VmSwap/{s+=$2}END{if(s>0)print FILENAME": "s" kB"}' $f; done 2>/dev/null | sort -t: -k2 -nr | head
# Remedies: limit process memory via cgroups, add RAM, lower vm.swappiness (default 60, set to 10)
sudo sysctl -w vm.swappiness=10
echo "vm.swappiness=10" >> /etc/sysctl.conf3.5 Memory bottleneck remediation
# 1. Limit process memory (cgroups or systemd)
# systemd service example:
[Service]
MemoryMax=2G
MemoryHigh=1.8G
# 2. Adjust OOM Killer behavior (not generally recommended)
sysctl -w vm.overcommit_memory=2
# 3. Tune JVM heap for Java apps
-Xms512m -Xmx1024m # fixed heap size
-XX:+HeapDumpOnOutOfMemoryError
-XX:MaxMetaspaceSize=256m
# 4. For containers, set memory limits in the deployment spec
containers:
resources:
limits:
memory: "2Gi"Disk I/O Bottleneck Investigation
4.1 Confirm whether disk I/O is the bottleneck
# iostat overview
iostat -xz 1 5
# Indicators:
# %util > 80 % → disk saturated
# await > 20 ms → slow I/O response
# avgqu‑sz > 1 → queue backlog4.2 Find processes with highest I/O
# Requires root
sudo iotop -oa # live I/O view
# If iotop unavailable, use pidstat
sudo pidstat -d 1 5
# Inspect a specific process
cat /proc/<pid>/io4.3 Common disk I/O bottleneck scenarios
Scenario 1: Heavy sequential writes (logs, backups)
# Identify write‑heavy processes
sudo iotop -oa | head -50
# Check write rate
iostat -xz 1 | grep sda # high w/s, low r/s indicates write‑heavy
# Remedies:
# 1. Asynchronous log writes
# 2. Write logs to tmpfs (memory filesystem)
# 3. Batch writes, reduce fsync frequencyScenario 2: Heavy random reads (databases, file services)
# Identify read‑heavy processes
sudo iotop -oa | head -50
# Check read IOPS
iostat -xz 1 | grep sda # high r/s indicates read‑heavy
# Typical fix for MySQL InnoDB:
# - Increase innodb_buffer_pool_size
# - Use SSDs
# - Optimize queries to reduce random readsScenario 3: Swap‑induced I/O
# High si/so indicates memory pressure causing swap I/O
vmstat 1 5
# Locate processes using swap (same loop as in memory section)
for f in /proc/*/status; do awk '/VmSwap/{s+=$2}END{if(s>0)print FILENAME": "s" kB"}' $f; done 2>/dev/null | sort -t: -k2 -nr | head
# Root solution: add RAM or reduce memory consumption4.4 Disk I/O remediation
# 1. Choose appropriate I/O scheduler
cat /sys/block/sda/queue/scheduler # e.g., none [mq-deadline] cfq bfq
# SSD → use "none" (noop)
echo none | sudo tee /sys/block/sda/queue/scheduler
# HDD → use "mq-deadline" or "bfq"
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler
# 2. Adjust I/O priority with ionice
ionice -c 1 -n 0 -p <pid> # real‑time highest priority
ionice -c 2 -n 7 -p <pid> # best‑effort low priority
# 3. Replace HDD with SSD
# 4. Use RAID controller cache
# 5. Separate high‑I/O and low‑I/O data onto different disksNetwork Bottleneck Investigation
5.1 Confirm whether the network is the bottleneck
# Interface statistics
ip -s link
# Bandwidth usage
sar -n DEV 1 5
# TCP connection states
netstat -an | awk '/^tcp/ {print $6}' | sort | uniq -c
# Many TIME_WAIT → many short connections
# Many SYN_RECV → possible SYN flood attack5.2 Find processes consuming bandwidth
# Tools like iptraf or nethogs are useful
nethogs -d 1
# Detailed connection info
ss -tunapl
# Raw interface counters
cat /proc/net/dev5.3 Common network bottleneck scenarios
Scenario 1: Bandwidth saturated
# Check per‑interface traffic
sar -n DEV 1 5 | grep -E "^Average|^Linux"
# If rxkb/s or txkb/s approaches interface limit → bandwidth full
# Count connections
netstat -an | awk '/^tcp/ {print $6}' | sort | uniq -c | sort -rn
# Many ESTABLISHED → high concurrencyScenario 2: Packet loss and retransmission
# Interface error counters
ip -s link | grep -A 5 "RX:"
# RX errors / RX dropped indicate loss
# TCP retransmission stats
netstat -s | grep -i retransmit
# Retransmission rate > 1 % signals network quality issuesScenario 3: Slow DNS resolution
# Measure DNS query time
dig example.com
# Slow responses affect any service that relies on DNS
# Common causes:
# 1. Slow DNS server
# 2. Wrong order in /etc/resolv.conf
# 3. Firewall blocking port 53
# Optimisation:
# Put fast DNS servers first in /etc/resolv.conf
# Deploy dnsmasq for local caching5.4 Network bottleneck remediation
# 1. Increase bandwidth (upgrade link or use CDN)
# 2. Deploy load balancers to spread traffic
# 3. Tune TCP parameters in /etc/sysctl.conf
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
# Apply changes
sudo sysctl -p
# 4. Mitigate DDoS / connection storms
sudo iptables -A INPUT -p tcp --dport 80 -m connlimit --connlimit-above 100 -j REJECTIntegrated Troubleshooting Cases
Case 1 – Load Average 30+, CPU usage low
Symptoms: Load average spikes to 30+, CPU idle ~15 %, request latency jumps from 100 ms to 5 s.
# Quick resource check
vmstat 1 3
# r=28 (processes waiting for CPU) but CPU idle high → I/O wait
# wa=60 % → I/O bottleneck
# Verify disk I/O
iostat -xz 1 3
# %util 98 %, avgqu‑sz 15, await 200 ms → disk saturated
# Identify offending process
sudo iotop -oa | head -30
# mysqld shows highest I/O
# MySQL analysis (write‑heavy example)
# Adjust innodb_flush_log_at_trx_commit or increase innodb_buffer_pool_size
# Root cause: InnoDB dirty‑page flushing causing massive disk I/O
# Fix: adjust innodb_io_capacity # Temporary adjustment
SET GLOBAL innodb_io_capacity = 2000;
SET GLOBAL innodb_max_dirty_pages_pct = 50;
# Permanent in /etc/my.cnf
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_max_dirty_pages_pct = 50Case 2 – Java service OOM, frequent Full GC
Symptoms: Java service stalls for ~5 s every 10 minutes.
# Check process memory
ps -p <pid> -o pid,vsz,rss,comm # RSS stays high
# GC statistics
jstat -gc <pid> 1s # OU grows, Full GC frequent
# Dump heap for analysis
jmap -dump:format=b,file=/tmp/heap.hprof <pid>
# Analyse with MAT (Memory Analyzer Tool)
# Root cause: memory leak or insufficient heap size
# Remedy: increase heap or fix leak # JVM tuning example
-Xms4g -Xmx4g # fixed 4 GB heap
-XX:+HeapDumpOnOutOfMemoryError
-XX:NewRatio=2 # Old = 2× New
-XX:+UseG1GC # G1 collector for large heapsPreventive Measures & Routine Health Checks
7.1 Build a monitoring & alerting system
#!/bin/bash
# server_health_check.sh – scheduled every 5 minutes via cron
HOST=$(hostname)
DATE=$(date +%Y%m%d_%H%M%S)
LOG="/var/log/server_health/${DATE}.log"
mkdir -p /var/log/server_health
{
echo "===== Server Health Check - $HOST - $DATE ====="
echo "Load Average: $(awk '{print $1}' /proc/loadavg) (cores: $(nproc))"
free -m | awk '/^Mem/{printf "Memory: total=%s used=%s free=%s available=%s
",$2,$3,$4,$7}'
df -h | awk '/^\/dev/{printf "Disk %s: usage=%s
",$6,$5}'
vmstat 1 2 | tail -1 | awk '{printf "CPU: us=%s sy=%s id=%s wa=%s
",$13,$14,$15,$16}'
echo "Top 3 CPU:"
ps aux --sort=-%cpu | head -4 | awk '{print " "$11" PID="$2" CPU="$3"% MEM="$4"%"}'
echo "Top 3 MEM:"
ps aux --sort=-%mem | head -4 | awk '{print " "$11" PID="$2" CPU="$3"% MEM="$4"%"}'
} > "$LOG"
# Simple alert if load ratio > 2× cores
LOAD=$(awk '{print $1}' /proc/loadavg)
CORES=$(nproc)
LOAD_RATIO=$(echo "scale=2; $LOAD/$CORES" | bc)
if [ "$(echo "$LOAD_RATIO > 2" | bc)" -eq 1 ]; then
echo "ALERT: Load Average $LOAD > 2x cores on $HOST" | tee -a /var/log/server_health/alerts.log
fi7.2 Common optimisation parameters
# /etc/sysctl.conf – typical performance tweaks
# Network
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_tw_buckets = 262144
# Memory – reduce swap usage
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 10
# File descriptors
fs.file-max = 655360
fs.nr_open = 655360
# Apply changes
sudo sysctl -pConclusion
The key to solving sudden server slowdown is to first pinpoint which resource (CPU, memory, disk I/O, or network) is the bottleneck and then conduct a deep dive on that resource to uncover the root cause.
Quick‑diagnosis four‑step method:
1. top – examine load average and CPU usage
2. vmstat – check r/b columns (process queue, blocked processes) and wa (I/O wait)
3. iostat – inspect %util and avgqu‑sz for disk saturation
4. free – look at available memory and swap usageTypical symptom patterns for each bottleneck type:
CPU bottleneck: Load > cores, high CPU%, low iowait
Memory bottleneck: High load, low CPU%, low iowait, available low, swap used
I/O bottleneck: High load, low CPU%, high iowait, %util > 80%, await > 20ms
Network bottleneck: High load, low CPU%, low iowait, bandwidth saturated, high packet loss/retransmitPrioritise remediation in this order: memory issues first (they affect all other resources), then disk I/O, followed by CPU and network based on business impact. Avoid blind hardware upgrades; always verify the root cause and apply targeted software or configuration fixes before scaling the infrastructure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
