Master Linux Performance Troubleshooting: From top to perf in One Complete Workflow
This guide presents a systematic, four‑dimensional USE methodology for diagnosing Linux performance issues, walking through quick 60‑second overviews with top, vmstat, iostat and ss, then diving into detailed CPU, memory, disk I/O and network investigations using tools such as mpstat, perf, bpf, and flame graphs.
Overview
The article presents a practical four‑dimensional USE (Utilization / Saturation / Errors) methodology for Linux system performance troubleshooting. It explains why guessing is insufficient and shows how to locate bottlenecks across CPU, memory, disk I/O and network layers using a reproducible step‑by‑step process.
Environment Requirements
Operating System : CentOS 7+ or Ubuntu 20.04+ (kernel ≥ 4.9 for full perf and eBPF support)
sysstat (v11.0+): provides mpstat, iostat, pidstat, sar perf : version must match the running kernel (usually provided by linux-tools-$(uname -r))
bcc‑tools (v0.12+): eBPF‑based tracing utilities (requires kernel ≥ 4.9)
dstat (v0.7+): combined real‑time monitoring
Step‑by‑Step Procedure
1. Quick 60‑Second Overview
uptime– shows load average and uptime. top -bn1 – snapshot of CPU, memory, swap and process states. vmstat 1 10 – system‑level snapshot (processes, memory, swap, I/O, CPU). dstat -tcmsdnl --top-cpu --top-io 5 – unified view of CPU, memory, disk, network. iostat -xz 1 5 – detailed disk utilization and latency. ss -s – summary of socket states.
Interpretation tips: high load average with low CPU usage often means many processes are in uninterruptible I/O wait (D state); %util > 80 % indicates a saturated device; wa > 5 % points to I/O bottlenecks.
2. CPU Deep Dive
mpstat -P ALL 1 5– per‑core utilization. pidstat -u 1 5 – per‑process CPU usage. perf top -p <PID> – real‑time hotspot functions. perf record -p <PID> -g -- sleep 30 – sampling for flame‑graph generation.
# Example: locate hot thread in a Java process
top -Hp 12345 -bn1 | head -20
printf "%x
" 12378 # convert TID to hex for jstack
jstack 12345 > /tmp/jstack.txt
grep -A30 "nid=0x305a" /tmp/jstack.txtFlame‑graph generation (requires Brendan Gregg’s FlameGraph repo):
# Install FlameGraph
git clone https://github.com/brendangregg/FlameGraph.git /opt/FlameGraph
# Generate flame graph
perf script | /opt/FlameGraph/stackcollapse-perf.pl | /opt/FlameGraph/flamegraph.pl > cpu_flame.svg3. Memory Investigation
free -h– focus on the available column rather than free. cat /proc/meminfo – detailed counters (Slab, Dirty, AnonPages, HugePages). smem -rkt -s pss – process‑level proportional set size (more accurate than RSS). pmap -x <PID> – memory map of a specific process. cat /proc/<PID>/status – quick view of VmSize, VmRSS, VmSwap. dmesg | grep -i oom – OOM Killer events.
# Check OOM logs
dmesg | grep -i "out of memory" -A20
# Show OOM score of all processes
for pid in $(ls /proc | grep -E '^[0-9]+$'); do
name=$(cat /proc/$pid/comm 2>/dev/null)
score=$(cat /proc/$pid/oom_score 2>/dev/null)
[ -n "$score" ] && [ $score -gt 100 ] && echo "$pid $name $score"
done | sort -k3 -rn | head -104. Disk I/O Investigation
iostat -xz 1 5– %util, await, queue depth, throughput. iotop -oP -b -n 5 -d 1 – processes with active I/O. pidstat -d 1 5 – per‑process I/O statistics. blktrace -d /dev/sda -o trace -w 10 – low‑level block tracing (short capture). df -hT and df -i – filesystem and inode usage.
When %util > 80 % (or > 95 % on SSD) and await is high, the disk is saturated; use blktrace or deeper iostat analysis.
5. Network Investigation
ss -s– socket state summary. ss -tn state established – list of established TCP connections. ss -tan state time-wait | wc -l – TIME_WAIT count. sar -n DEV 1 5 – per‑interface bandwidth and errors. nstat -az | grep -i tcp – TCP counters (retransmissions, listen overflows). tcpdump -i eth0 port 80 -c 10000 -w /tmp/capture.pcap – packet capture with limits. ethtool -S eth0 – NIC error statistics. cat /proc/net/softnet_stat – soft‑interrupt distribution per CPU (high second column indicates kernel‑level packet processing overload).
Key interpretation: high rx_dropped or a growing second column in /proc/net/softnet_stat suggests packet‑processing overload; tune net.core.netdev_max_backlog, ring buffers, or enable RPS/RFS.
Automation Scripts
Performance Snapshot Script ( perf_snapshot.sh )
#!/bin/bash
set -euo pipefail
OUTPUT_DIR=${1:-/tmp/perf_snapshot_$(date +%Y%m%d_%H%M%S)}
mkdir -p "$OUTPUT_DIR"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
collect(){ name=$1; shift; log "Collect $name ..." "$@" > "$OUTPUT_DIR/$name.txt" 2>&1 || echo "Failed $name: $?" >> "$OUTPUT_DIR/errors.log"; }
log "Start collection, output: $OUTPUT_DIR"
# Basic info
collect "uname" uname -a
collect "uptime" uptime
collect "date" date '+%Y-%m-%d %H:%M:%S %Z'
collect "hostname" hostname -f
# CPU
collect "top_snapshot" bash -c "top -bn1 | head -50"
collect "mpstat" mpstat -P ALL 1 5
collect "pidstat_cpu" pidstat -u 1 5
collect "pidstat_context" pidstat -w 1 5
# Memory
collect "free" free -h
collect "meminfo" cat /proc/meminfo
collect "slabtop" slabtop -o
collect "smem" bash -c "smem -rkt -s pss 2>/dev/null || echo 'smem not installed'"
# Disk I/O
collect "iostat" iostat -xz 1 5
collect "pidstat_io" pidstat -d 1 5
collect "df" df -hT
collect "df_inode" df -i
# Network
collect "ss_summary" ss -s
collect "ss_established" bash -c "ss -tn state established | head -100"
collect "ss_time_wait" bash -c "ss -tan state time-wait | wc -l"
collect "ss_listen" ss -tlnp
collect "netstat_stats" bash -c "nstat -az 2>/dev/null || netstat -s"
# vmstat & dmesg
collect "vmstat" vmstat 1 10
collect "dmesg_errors" bash -c "dmesg -T 2>/dev/null | tail -100"
collect "dmesg_oom" bash -c "dmesg | grep -i 'oom\|out of memory\|killed process' || echo 'No OOM events'"
# Process list
collect "ps_aux" bash -c "ps aux --sort=-%mem | head -30"
collect "ps_d_state" bash -c "ps aux | awk '$8~/D/' || echo 'No D state processes'"
# Recent journal / syslog
collect "journal_recent" bash -c "journalctl --since '10 minutes ago' --no-pager 2>/dev/null | tail -200 || tail -200 /var/log/syslog 2>/dev/null || echo 'No syslog access'"
log "Collection finished. Files:"; ls -la "$OUTPUT_DIR"
# Package
tar czf "${OUTPUT_DIR}.tar.gz" -C "$(dirname "$OUTPUT_DIR")" "$(basename "$OUTPUT_DIR")"
log "Packaged: ${OUTPUT_DIR}.tar.gz (size: $(du -sh "${OUTPUT_DIR}.tar.gz" | awk '{print $1}')"Continuous Monitoring Script ( perf_monitor.sh )
#!/bin/bash
set -euo pipefail
INTERVAL=${1:-5}
DURATION=${2:-3600}
OUTPUT=/tmp/perf_monitor_$(date +%Y%m%d_%H%M%S).csv
echo "timestamp,load1,load5,load15,cpu_us,cpu_sy,cpu_wa,cpu_st,mem_used_pct,swap_used_mb,disk_util,net_rx_kb,net_tx_kb,tcp_estab,tcp_tw,context_switch,interrupts" > "$OUTPUT"
END_TIME=$((SECONDS + DURATION))
while [ $SECONDS -lt $END_TIME ]; do
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
read LOAD1 LOAD5 LOAD15 <<<$(awk '{print $1,$2,$3}' /proc/loadavg)
read CPU_US CPU_SY CPU_WA CPU_ST <<<$(vmstat 1 2 | tail -1 | awk '{print $13,$14,$16,$17}')
MEM_USED_PCT=$(free | awk '/Mem:/{printf "%.1f",($3/$2)*100}')
SWAP_USED=$(free -m | awk '/Swap:/{print $3}')
DISK_UTIL=$(iostat -xz 1 2 | awk '/^[a-z]/{if($NF+0>max)max=$NF+0} END{print max+0}')
read NET_RX NET_TX <<<$(sar -n DEV 1 1 2>/dev/null | awk '/Average:/ && !/lo/{print $5,$6}' || echo "0 0")
TCP_ESTAB=$(ss -tn state established 2>/dev/null | wc -l)
TCP_TW=$(ss -tan state time-wait 2>/dev/null | wc -l)
read CS INTR <<<$(vmstat 1 2 | tail -1 | awk '{print $12,$11}')
echo "$TIMESTAMP,$LOAD1,$LOAD5,$LOAD15,$CPU_US,$CPU_SY,$CPU_WA,$CPU_ST,$MEM_USED_PCT,$SWAP_USED,$DISK_UTIL,$NET_RX,$NET_TX,$TCP_ESTAB,$TCP_TW,$CS,$INTR" >> "$OUTPUT"
sleep $INTERVAL
done
echo "Monitoring finished, data file: $OUTPUT (lines: $(wc -l < "$OUTPUT"))"Real‑World Cases
Case 1 – Java CPU 100 % caused by ReDoS regex
Identify the Java process: top -bn1 | grep java.
Find the hottest thread: top -Hp <PID> -bn1 | head -20.
Convert thread ID to hex for jstack: printf "%x\n" <TID>.
Dump stack traces: jstack <PID> > /tmp/jstack.txt and search for the hex ID.
The stack shows the thread stuck in a complex email‑validation regular expression, confirming a ReDoS vulnerability.
Solution: restart the service, replace the regex or add a timeout, and add regex ReDoS testing to CI.
Case 2 – MySQL I/O Wait
Confirm I/O saturation: iostat -xz 1 3 (e.g., %util 95 %, w_await 45 ms).
Identify the offending process: iotop -oP -b -n 3 -d 1 (shows mysqld writing 180 MB/s).
Inspect MySQL workload: mysql -e "SHOW PROCESSLIST\G" and the slow‑query log.
Analyze with pt-query-digest and add a composite index on (status, create_time).
Increase InnoDB buffer pool, tune innodb_io_capacity, and restart MySQL.
Case 3 – High Load Average with Low CPU Usage
Check for D‑state processes: vmstat 1 5 (column b high) and ps aux | awk '$8~/^D/{print}'.
Identify I/O bottleneck: iostat -xz 1 3 (high %util, large await).
Find the culprit: iotop -oP -b -n 3 (many rsync processes).
Root cause: simultaneous backup jobs from dozens of servers saturate the disk.
Fix: stagger backups, add --bwlimit to rsync, or upgrade storage.
Case 4 – Network Packet Loss
Verify loss: ping -c 100 10.0.0.2 (e.g., 5 % loss).
Check NIC errors: ethtool -S eth0 | grep -E "drop|error|fifo" (high rx_dropped).
Inspect ring buffer size: ethtool -g eth0 and enlarge with ethtool -G eth0 rx 4096.
Analyze soft‑interrupt distribution: cat /proc/net/softnet_stat (second column > 0 indicates budget exhaustion).
Enable RPS/RFS: echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus.
Retest: ping shows 0 % loss.
Best Practices and Pitfalls
Kernel Parameter Tuning
vm.swappiness = 10– reduces swap usage; set to 1 for databases. vm.dirty_ratio = 10 and vm.dirty_background_ratio = 5 – limit dirty page buildup. net.core.somaxconn = 65535 and net.ipv4.tcp_max_syn_backlog = 65535 – enlarge connection queues for web services. net.core.netdev_max_backlog = 50000 – prevent NIC receive‑queue overflow on high‑throughput NICs. net.ipv4.tcp_tw_reuse = 1 – allow TIME_WAIT reuse in non‑NAT environments. fs.file-max = 2097152 – raise global file descriptor limit.
Persist settings in /etc/sysctl.d/99-performance.conf and apply with sysctl -p.
Security Hardening
Restrict perf to root: echo 2 > /proc/sys/kernel/perf_event_paranoid.
Audit execution of performance tools:
auditctl -a always,exit -F path=/usr/bin/perf -F perm=x -k perf_usage.
Limit tcpdump captures with -c or -G and delete pcap files after analysis.
Constant Monitoring Stack
Deploy node_exporter on each host, scrape with Prometheus, and visualise in Grafana. Example node_exporter systemd unit is omitted for brevity.
# /etc/prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage too high on {{ $labels.instance }}"
description: "CPU usage {{ $value | printf \"%.1f\" }}% for more than 5 minutes"
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage too high on {{ $labels.instance }}"
description: "Available memory below 10% ({{ $value | printf \"%.1f\" }}%)"
- alert: HighLoadAverage
expr: node_load1 / count without(cpu,mode) (node_cpu_seconds_total{mode="idle"}) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Load average exceeds twice the CPU count on {{ $labels.instance }}"
description: "Load average is high for 5 minutes"
- alert: HighDiskUtilization
expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 90
for: 1m
labels:
severity: critical
annotations:
summary: "Disk I/O saturation on {{ $labels.instance }}"
description: "Device {{ $labels.device }} utilization > 90%"
- alert: DiskSpaceRunningOut
expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Mount {{ $labels.mountpoint }} usage {{ $value | printf \"%.1f\" }}%"
- alert: HighNetworkErrors
expr: rate(node_network_receive_errs_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Network receive errors on {{ $labels.instance }}"
description: "Device {{ $labels.device }} error rate {{ $value | printf \"%.1f\" }}/s"
- alert: SwapUsageIncreasing
expr: rate(node_memory_SwapFree_bytes[10m]) < -1048576
for: 10m
labels:
severity: warning
annotations:
summary: "Swap usage increasing on {{ $labels.instance }}"
description: "Swap is being consumed, possible memory leak"Monitoring Indicators
CPU Utilization (us+sy) < 70 % (alert > 85 % for 5 min).
Load Average (1 min) < CPU cores (alert > 2 × CPU cores for 5 min).
Memory Available > 20 % (alert < 10 %).
Swap Used = 0 (alert > 0 and growing).
Disk %util < 70 % (alert > 90 % for 1 min; SSD can tolerate up to 95 %).
IO await < 10 ms (HDD) / < 2 ms (SSD) (alert > 30 ms HDD, > 10 ms SSD).
Network %ifutil < 70 % (alert > 85 %).
TCP retransmission rate < 0.1 % (alert > 1 %).
TIME_WAIT count < 20 000 (alert > 50 000).
Context switches per second < 50 000 (alert > 100 000).
Conclusion
The USE methodology combined with a rapid 60‑second global snapshot and a layered deep‑dive toolbox (mpstat, pidstat, perf, eBPF, flame graphs, iostat, ss, etc.) enables engineers to locate and resolve Linux performance problems within minutes. Proper kernel tuning, security hardening, and continuous Prometheus‑based monitoring turn reactive troubleshooting into proactive reliability engineering.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
