Operations 32 min read

How to Diagnose Slow Server Responses: Full‑Scope CPU, Memory, Disk & Network Analysis

This guide walks Linux operators through a systematic, four‑dimensional investigation of server slowdown—covering CPU, memory, disk I/O, and network—using concrete commands, diagnostic scripts, real‑world scenarios, and step‑by‑step remediation strategies to pinpoint and resolve performance bottlenecks.

Ops Community
Ops Community
Ops Community
How to Diagnose Slow Server Responses: Full‑Scope CPU, Memory, Disk & Network Analysis

Quickly Identify the Bottleneck Resource

When a server becomes sluggish, first determine which of the four core resources (CPU, memory, disk I/O, network) is limiting.

1.1 Use top for a global view

# top -bn1
# Observe the first three lines:
# 10:15:32 up 45 days,  3:22,  2 users,  load average: 12.5, 10.2, 8.0
# Tasks: 1234 total,   4 running, 1230 sleeping,   0 stopped,   0 zombie
# %Cpu(s): 15.2 us,  3.1 sy,  0.0 ni, 81.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

The three lines show:

Load average vs. CPU core count

Process state distribution

CPU usage percentages

1.2 Use vmstat for overall system metrics

# vmstat 1 5   # output once per second, five times
# Columns: r b swpd free buff cache si so bi bo in cs us sy id wa st

Key column meanings:

r: running processes (≈ 1‑minute load average)
b: processes blocked on I/O (D state)
swpd: used swap (KB)
free: free memory (KB)
buff: buffer memory (KB)
cache: cache memory (KB)
si: swap‑in rate (KB/s)
so: swap‑out rate (KB/s)
bi: blocks received per second
bo: blocks sent per second
us: user‑space CPU usage
sy: system‑space CPU usage
id: idle CPU percentage (lower = busier)
wa: CPU time waiting for I/O (higher = I/O bottleneck)

Judgement rules:

r > CPU core count → CPU queue backlog
b > 5 → severe I/O wait
wa > 20% → I/O bottleneck (check together with id)
si/so > 0 → memory pressure, swap in use
free consistently low → memory tight

1.3 Use iostat for disk I/O details

# iostat -xz 1 5   # detailed I/O every second, five samples
# Important fields:
# r/s, w/s – IOPS (reads/writes per second)
# rkB/s, wkB/s – throughput (KB/s)
# avgqu‑sz – average queue length (>1 indicates wait, >5 severe)
# await – average I/O wait time (ms, >20 ms slow)
# %util – device utilization (>80 % indicates disk is a bottleneck)

If %util approaches 100 %, the disk is the bottleneck.

1.4 Use free for memory status

# free -m
#               total   used   free   shared  buff/cache  available
# Mem:          32000  28000   4000    2000     8000        2000
# Swap:          8192     0    8192

Judgement rules:

available < total * 10% → memory tight
available stays low → memory shortage
swap used > 0 → system is swapping (memory insufficient)

1.5 Quick‑diagnosis script

#!/bin/bash
# server_quick_diag.sh – locate performance bottleneck within a minute

echo "===== System Overview ====="
 uptime

echo ""

echo "===== CPU Status ====="
 nproc
 vmstat 1 2 | tail -1

echo ""

echo "===== Memory Status ====="
 free -m

echo ""

echo "===== Disk I/O ====="
 iostat -xz 1 2 | tail -20

echo ""

echo "===== Network Status ====="
 sar -n DEV 1 2 | grep -E "^Average|^Linux" | tail -10

echo ""

echo "===== Top 5 CPU Processes ====="
 ps aux --sort=-%cpu | head -7

echo ""

echo "===== Top 5 Memory Processes ====="
 ps aux --sort=-%mem | head -7

echo ""

echo "===== Top 5 I/O Processes ====="
 ps aux --sort=-%mem | head -7

CPU Bottleneck Investigation

2.1 Confirm whether CPU is truly the bottleneck

# Show per‑core CPU usage
 top -bn1 | head -20
# High "us" → user‑space consumption
# High "sy" → kernel‑space consumption (many syscalls, context switches)
# Compare load average with core count
 nproc   # number of CPU cores
 uptime   # load average
# Load average > core count → processes queuing for CPU

2.2 Find the processes consuming the most CPU

# Interactive view
 top   # press Shift+P to sort by CPU
# Non‑interactive sorting
 ps aux --sort=-%cpu | head -20
# Filter out defunct processes
 ps aux | grep -v defunct | sort -k3nr | head -20

2.3 Analyse CPU consumption of a specific process

# Thread count (useful for multi‑process apps)
 ps -eLf | grep <pid> | wc -l
# CPU usage trend (requires sysstat)
 sar -p 1 60 > /tmp/cpu_sar.log
# Process priority and nice value
 ps -eo pid,ni,pri,pcpu,comm | grep <pid>
# ni: nice value (-20..19, lower = higher priority)

2.4 Common CPU bottleneck scenarios

Scenario 1: A single business process consumes 100 % CPU

# Identify the process
 ps aux --sort=-%cpu | head -10
# Java process – inspect thread stacks
 jstack <pid> > /tmp/jstack.log
# Python process – GIL limitation
 ps -eLf | grep <pid> | wc -l   # thread count
# Remedy: use multiprocessing instead of multithreading
# Nginx/PHP‑FPM – adjust pm.max_children, pm.start_servers
 ps -eLf | grep php-fpm | wc -l

Scenario 2: Massive short‑lived processes cause high scheduling overhead

# Check process creation rate
 cat /proc/loadavg   # fifth field = last created PID
# Rapid PID growth indicates many short‑lived processes
 ps aux | wc -l
# Identify which user spawns them
 ps -eo user,pid,cmd | awk '{users[$1]++} END {for (u in users) print u": "users[u]}' | sort -t: -k2 -nr | head

Scenario 3: Excessive context switches

# Context switch count per second
 vmstat 1   # "cs" column
# Which process has the most switches
 pidstat -w 1 5   # cswch/s (voluntary), nvcswch/s (involuntary)

2.5 CPU bottleneck remediation

# 1. Adjust process priority temporarily
 sudo renice -n -10 -p <pid>
# 2. Limit CPU via cgroups
 # Edit /etc/cgconfig.conf
 group limit_cpu {
   cpu {
     cpu.cfs_quota_us = 50000;   # 50 % of a CPU
     cpu.cfs_period_us = 100000;
   }
 }
# 3. If single‑threaded, consider multi‑process or multiple instances
# 4. Upgrade to more cores or higher frequency CPUs
# 5. Bind process to specific cores
 sudo taskset -p -c 0,1,2,3 <pid>

Memory Bottleneck Investigation

3.1 Confirm whether memory is the bottleneck

# Show memory usage
 free -m
# Focus on "available" (free + buff/cache – unreclaimable)
# If available < 10 % of total → memory pressure
# Check swap usage
 swapon -s   # any used swap indicates insufficient RAM

3.2 Find processes consuming the most memory

# Sort by memory usage
 ps aux --sort=-%mem | head -20
# Detailed memory map of a process
 ps -p <pid> -o pid,vsz,rss,comm
# Top memory consumers via top
 top   # press Shift+M to sort by memory

3.3 Analyse memory consumption of a process

# Read /proc/<pid>/status for detailed mapping
 cat /proc/<pid>/status | grep -E "Vm|Rss|Pid"
# Example output:
# VmPeak: 524288 kB   # peak virtual memory
# VmSize: 524288 kB   # current virtual memory
# VmRSS: 102400 kB    # resident physical memory
# VmData: 409600 kB   # heap size
# Track memory trend (requires sar or custom script)
 # Record RSS every minute with ps aux

3.4 Common memory bottleneck scenarios

Scenario 1: Memory leak

# Monitor RSS growth
 watch -n 1 "ps -p <pid> -o pid,vsz,rss,comm"
# If RSS continuously rises, a leak is present
# Java leak detection
 jstat -gc <pid> 1s   # monitor GC stats; OU growing without drop indicates leak

Scenario 2: OOM Killer activation

# Search kernel logs for OOM events
 sudo dmesg | grep -i "out of memory"
 sudo journalctl -xb | grep -i "killed process"
# Identify victim process and OOM score
 ps -eo pid,comm,oom_score | sort -k3 -nr | head -20
# Lower OOM score to protect a process
 sudo bash -c 'echo -1000 > /proc/<pid>/oom_score_adj'

Scenario 3: Heavy swap usage

# Show swap usage
 swapon -s
 free -m
# Persistent swap‑in/out indicates memory pressure
 vmstat 1 5   # watch "si" and "so"
# Find processes using swap (Linux ≥2.6.34)
 for f in /proc/*/status; do awk '/VmSwap/{s+=$2}END{if(s>0)print FILENAME": "s" kB"}' $f; done 2>/dev/null | sort -t: -k2 -nr | head
# Remedies: limit process memory via cgroups, add RAM, lower vm.swappiness (default 60, set to 10)
 sudo sysctl -w vm.swappiness=10
 echo "vm.swappiness=10" >> /etc/sysctl.conf

3.5 Memory bottleneck remediation

# 1. Limit process memory (cgroups or systemd)
 # systemd service example:
 [Service]
 MemoryMax=2G
 MemoryHigh=1.8G
# 2. Adjust OOM Killer behavior (not generally recommended)
 sysctl -w vm.overcommit_memory=2
# 3. Tune JVM heap for Java apps
 -Xms512m -Xmx1024m   # fixed heap size
 -XX:+HeapDumpOnOutOfMemoryError
 -XX:MaxMetaspaceSize=256m
# 4. For containers, set memory limits in the deployment spec
 containers:
   resources:
     limits:
       memory: "2Gi"

Disk I/O Bottleneck Investigation

4.1 Confirm whether disk I/O is the bottleneck

# iostat overview
 iostat -xz 1 5
# Indicators:
# %util > 80 % → disk saturated
# await > 20 ms → slow I/O response
# avgqu‑sz > 1 → queue backlog

4.2 Find processes with highest I/O

# Requires root
 sudo iotop -oa   # live I/O view
# If iotop unavailable, use pidstat
 sudo pidstat -d 1 5
# Inspect a specific process
 cat /proc/<pid>/io

4.3 Common disk I/O bottleneck scenarios

Scenario 1: Heavy sequential writes (logs, backups)

# Identify write‑heavy processes
 sudo iotop -oa | head -50
# Check write rate
 iostat -xz 1 | grep sda   # high w/s, low r/s indicates write‑heavy
# Remedies:
# 1. Asynchronous log writes
# 2. Write logs to tmpfs (memory filesystem)
# 3. Batch writes, reduce fsync frequency

Scenario 2: Heavy random reads (databases, file services)

# Identify read‑heavy processes
 sudo iotop -oa | head -50
# Check read IOPS
 iostat -xz 1 | grep sda   # high r/s indicates read‑heavy
# Typical fix for MySQL InnoDB:
# - Increase innodb_buffer_pool_size
# - Use SSDs
# - Optimize queries to reduce random reads

Scenario 3: Swap‑induced I/O

# High si/so indicates memory pressure causing swap I/O
 vmstat 1 5
# Locate processes using swap (same loop as in memory section)
 for f in /proc/*/status; do awk '/VmSwap/{s+=$2}END{if(s>0)print FILENAME": "s" kB"}' $f; done 2>/dev/null | sort -t: -k2 -nr | head
# Root solution: add RAM or reduce memory consumption

4.4 Disk I/O remediation

# 1. Choose appropriate I/O scheduler
 cat /sys/block/sda/queue/scheduler   # e.g., none [mq-deadline] cfq bfq
 # SSD → use "none" (noop)
 echo none | sudo tee /sys/block/sda/queue/scheduler
 # HDD → use "mq-deadline" or "bfq"
 echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler
# 2. Adjust I/O priority with ionice
 ionice -c 1 -n 0 -p <pid>   # real‑time highest priority
 ionice -c 2 -n 7 -p <pid>   # best‑effort low priority
# 3. Replace HDD with SSD
# 4. Use RAID controller cache
# 5. Separate high‑I/O and low‑I/O data onto different disks

Network Bottleneck Investigation

5.1 Confirm whether the network is the bottleneck

# Interface statistics
 ip -s link
# Bandwidth usage
 sar -n DEV 1 5
# TCP connection states
 netstat -an | awk '/^tcp/ {print $6}' | sort | uniq -c
# Many TIME_WAIT → many short connections
# Many SYN_RECV → possible SYN flood attack

5.2 Find processes consuming bandwidth

# Tools like iptraf or nethogs are useful
 nethogs -d 1
# Detailed connection info
 ss -tunapl
# Raw interface counters
 cat /proc/net/dev

5.3 Common network bottleneck scenarios

Scenario 1: Bandwidth saturated

# Check per‑interface traffic
 sar -n DEV 1 5 | grep -E "^Average|^Linux"
# If rxkb/s or txkb/s approaches interface limit → bandwidth full
# Count connections
 netstat -an | awk '/^tcp/ {print $6}' | sort | uniq -c | sort -rn
# Many ESTABLISHED → high concurrency

Scenario 2: Packet loss and retransmission

# Interface error counters
 ip -s link | grep -A 5 "RX:"
# RX errors / RX dropped indicate loss
# TCP retransmission stats
 netstat -s | grep -i retransmit
# Retransmission rate > 1 % signals network quality issues

Scenario 3: Slow DNS resolution

# Measure DNS query time
 dig example.com
# Slow responses affect any service that relies on DNS
# Common causes:
# 1. Slow DNS server
# 2. Wrong order in /etc/resolv.conf
# 3. Firewall blocking port 53
# Optimisation:
# Put fast DNS servers first in /etc/resolv.conf
# Deploy dnsmasq for local caching

5.4 Network bottleneck remediation

# 1. Increase bandwidth (upgrade link or use CDN)
# 2. Deploy load balancers to spread traffic
# 3. Tune TCP parameters in /etc/sysctl.conf
 net.ipv4.tcp_tw_reuse = 1
 net.ipv4.tcp_fin_timeout = 30
 net.core.somaxconn = 65535
 net.ipv4.tcp_max_syn_backlog = 65535
# Apply changes
 sudo sysctl -p
# 4. Mitigate DDoS / connection storms
 sudo iptables -A INPUT -p tcp --dport 80 -m connlimit --connlimit-above 100 -j REJECT

Integrated Troubleshooting Cases

Case 1 – Load Average 30+, CPU usage low

Symptoms: Load average spikes to 30+, CPU idle ~15 %, request latency jumps from 100 ms to 5 s.

# Quick resource check
 vmstat 1 3
# r=28 (processes waiting for CPU) but CPU idle high → I/O wait
# wa=60 % → I/O bottleneck
# Verify disk I/O
 iostat -xz 1 3
# %util 98 %, avgqu‑sz 15, await 200 ms → disk saturated
# Identify offending process
 sudo iotop -oa | head -30
# mysqld shows highest I/O
# MySQL analysis (write‑heavy example)
# Adjust innodb_flush_log_at_trx_commit or increase innodb_buffer_pool_size
# Root cause: InnoDB dirty‑page flushing causing massive disk I/O
# Fix: adjust innodb_io_capacity
# Temporary adjustment
 SET GLOBAL innodb_io_capacity = 2000;
 SET GLOBAL innodb_max_dirty_pages_pct = 50;
# Permanent in /etc/my.cnf
 innodb_io_capacity = 2000
 innodb_io_capacity_max = 4000
 innodb_max_dirty_pages_pct = 50

Case 2 – Java service OOM, frequent Full GC

Symptoms: Java service stalls for ~5 s every 10 minutes.

# Check process memory
 ps -p <pid> -o pid,vsz,rss,comm   # RSS stays high
# GC statistics
 jstat -gc <pid> 1s   # OU grows, Full GC frequent
# Dump heap for analysis
 jmap -dump:format=b,file=/tmp/heap.hprof <pid>
# Analyse with MAT (Memory Analyzer Tool)
# Root cause: memory leak or insufficient heap size
# Remedy: increase heap or fix leak
# JVM tuning example
 -Xms4g -Xmx4g               # fixed 4 GB heap
 -XX:+HeapDumpOnOutOfMemoryError
 -XX:NewRatio=2               # Old = 2× New
 -XX:+UseG1GC                # G1 collector for large heaps

Preventive Measures & Routine Health Checks

7.1 Build a monitoring & alerting system

#!/bin/bash
# server_health_check.sh – scheduled every 5 minutes via cron
HOST=$(hostname)
DATE=$(date +%Y%m%d_%H%M%S)
LOG="/var/log/server_health/${DATE}.log"

mkdir -p /var/log/server_health
{
  echo "===== Server Health Check - $HOST - $DATE ====="
  echo "Load Average: $(awk '{print $1}' /proc/loadavg) (cores: $(nproc))"
  free -m | awk '/^Mem/{printf "Memory: total=%s used=%s free=%s available=%s
",$2,$3,$4,$7}'
  df -h | awk '/^\/dev/{printf "Disk %s: usage=%s
",$6,$5}'
  vmstat 1 2 | tail -1 | awk '{printf "CPU: us=%s sy=%s id=%s wa=%s
",$13,$14,$15,$16}'
  echo "Top 3 CPU:"
  ps aux --sort=-%cpu | head -4 | awk '{print "  "$11" PID="$2" CPU="$3"% MEM="$4"%"}'
  echo "Top 3 MEM:"
  ps aux --sort=-%mem | head -4 | awk '{print "  "$11" PID="$2" CPU="$3"% MEM="$4"%"}'
} > "$LOG"

# Simple alert if load ratio > 2× cores
LOAD=$(awk '{print $1}' /proc/loadavg)
CORES=$(nproc)
LOAD_RATIO=$(echo "scale=2; $LOAD/$CORES" | bc)
if [ "$(echo "$LOAD_RATIO > 2" | bc)" -eq 1 ]; then
  echo "ALERT: Load Average $LOAD > 2x cores on $HOST" | tee -a /var/log/server_health/alerts.log
fi

7.2 Common optimisation parameters

# /etc/sysctl.conf – typical performance tweaks
# Network
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_tw_buckets = 262144
# Memory – reduce swap usage
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 10
# File descriptors
fs.file-max = 655360
fs.nr_open = 655360
# Apply changes
sudo sysctl -p

Conclusion

The key to solving sudden server slowdown is to first pinpoint which resource (CPU, memory, disk I/O, or network) is the bottleneck and then conduct a deep dive on that resource to uncover the root cause.

Quick‑diagnosis four‑step method:

1. top – examine load average and CPU usage
2. vmstat – check r/b columns (process queue, blocked processes) and wa (I/O wait)
3. iostat – inspect %util and avgqu‑sz for disk saturation
4. free – look at available memory and swap usage

Typical symptom patterns for each bottleneck type:

CPU bottleneck:   Load > cores, high CPU%, low iowait
Memory bottleneck: High load, low CPU%, low iowait, available low, swap used
I/O bottleneck:    High load, low CPU%, high iowait, %util > 80%, await > 20ms
Network bottleneck: High load, low CPU%, low iowait, bandwidth saturated, high packet loss/retransmit

Prioritise remediation in this order: memory issues first (they affect all other resources), then disk I/O, followed by CPU and network based on business impact. Avoid blind hardware upgrades; always verify the root cause and apply targeted software or configuration fixes before scaling the infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performancenetworkLinuxtroubleshootingCPUMemoryDisk I/O
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.