Operations 56 min read

Linux System Performance Troubleshooting: Complete End‑to‑End Workflow from top to perf

This article presents a systematic, USE‑methodology‑based workflow for diagnosing Linux performance issues, covering CPU, memory, disk I/O and network bottlenecks with step‑by‑step commands, detailed examples, scripts, case studies, best‑practice recommendations and monitoring guidelines.

Raymond Ops
Raymond Ops
Raymond Ops
Linux System Performance Troubleshooting: Complete End‑to‑End Workflow from top to perf

Overview

Performance incidents such as high latency, load spikes, OOM kills or packet loss are diagnosed with the USE methodology (Utilization / Saturation / Errors). The guide provides a four‑dimensional, layered workflow that covers CPU, memory, disk I/O and network, using only standard Linux tools and eBPF‑based utilities.

Environment Requirements

Operating system: CentOS 7+ or Ubuntu 20.04+ (kernel ≥ 4.9 for full perf/BPF support)

sysstat ≥ 11.0 (mpstat, iostat, pidstat, sar)

perf matching the running kernel (linux‑tools-$(uname -r))

bcc‑tools ≥ 0.12 (eBPF tracing)

dstat ≥ 0.7 (unified monitoring)

Quick 60‑Second Diagnosis

Load average (uptime)

Run uptime. The three numbers are 1‑, 5‑, and 15‑minute load averages. If 1 min > 5 min > 15 min the load is rising; compare the absolute value with the CPU core count (e.g., 8‑core machine: load ≈ 8 = full, load ≫ 8 = severe). Load includes D ‑state processes, so a high value does not always mean CPU saturation.

Global resource view (top)

Run top -bn1 | head -20. Key fields and alert thresholds: us > 80 % → user‑space CPU bound sy > 20 % → system‑call intensive or lock contention wa > 5 % → I/O wait st > 5 % → VM steal (hypervisor over‑commit) si high → network soft‑interrupt overload

System‑level snapshot (vmstat)

Run vmstat 1 10. Important columns: r > CPU cores → run‑queue saturation b > 0 → processes in uninterruptible sleep (I/O wait) si/so > 0 → swap activity (memory pressure) cs > ≈ 100 k/s per core is a warning (context switches) wa → I/O wait percentage

Unified real‑time monitoring (dstat)

Run dstat -tcmsdnl --top-cpu --top-io 5. Options: -t timestamp -c CPU -m memory -s swap -d disk -n network --top-cpu highest CPU consumers --top-io highest I/O consumers

CPU Deep Dive

Per‑CPU utilization (mpstat)

mpstat -P ALL 1 5

Look for a core stuck at 100 % (single‑thread bottleneck) or high %sys (system‑call intensive) or high %soft (network soft‑interrupt concentration). Check /proc/interrupts if needed.

Process‑level CPU (pidstat)

# Show processes using >10 % CPU every second
pidstat -u 1 5 | awk 'NR<=3 || $8>10'
# Show threads of a specific PID
pidstat -u -t -p <PID> 1 5

Hot functions (perf top / perf record)

# Real‑time hot functions
perf top -p <PID>
# Sample for 30 s and generate a flame graph
perf record -p <PID> -g -- sleep 30
perf script | /opt/FlameGraph/stackcollapse-perf.pl \
    | /opt/FlameGraph/flamegraph.pl > cpu_flame.svg
# Install FlameGraph repository
git clone https://github.com/brendangregg/FlameGraph.git /opt/FlameGraph

Note: perf top requires root; non‑root users can set /proc/sys/kernel/perf_event_paranoid to 1.

Memory Investigation

Memory overview (free)

free -h

Focus on the available column (includes reclaimable cache). available < 10 % of total RAM is a warning. Any non‑zero Swap usage that grows indicates physical memory shortage. The shared column shows tmpfs usage (e.g., PostgreSQL shared memory).

Detailed metrics (/proc/meminfo)

cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Slab|SReclaimable|SUnreclaim|Dirty|Writeback|AnonPages|Mapped|Shmem|HugePages_Total"

Key fields: Slab/SUnreclaim – kernel slab usage; growing SUnreclaim may indicate a kernel memory leak. Dirty/Writeback – high values mean dirty pages cannot be flushed fast enough. AnonPages – actual memory used by processes. HugePages_Total – large‑page allocation (databases).

Kernel slab analysis (slabtop)

slabtop -o | head -20   # sort by memory size
slabtop -d 2            # continuous monitoring every 2 s
# Drop caches (requires root)
echo 3 > /proc/sys/vm/drop_caches   # pagecache + dentries + inodes
# Production recommendation: drop only dentries and inodes
 echo 2 > /proc/sys/vm/drop_caches

Warning: Dropping caches causes a short I/O spike; avoid during peak traffic.

Process memory (smem)

# Install
yum install -y smem   # CentOS
apt install -y smem   # Ubuntu
# Show top processes by PSS (proportional set size)
smem -rkt -s pss | head -20

PSS distributes shared library memory proportionally, giving a more accurate per‑process footprint than RSS. Example: 50 PHP‑FPM workers show 8 GB RSS total but only 3 GB PSS.

Process‑level memory details

# Memory map of a process
pmap -x <PID> | tail -1
# Detailed status
cat /proc/<PID>/status | grep -E "VmSize|VmRSS|VmSwap|Threads"
# Full smaps summary (requires root)
cat /proc/<PID>/smaps_rollup

OOM Killer investigation

# Recent OOM logs
dmesg | grep -i "oom\|out of memory" | tail -20
# Which process was killed
dmesg | grep "Killed process"
# Current OOM score (higher = more likely to be killed)
cat /proc/<PID>/oom_score
# Adjust OOM priority (‑1000 = never killed)
echo -1000 > /proc/<PID>/oom_score_adj

Disk I/O Investigation

Device utilization (iostat)

iostat -xz 1 5

Important columns and alert thresholds: %util > 80 % → near saturation on HDD (higher acceptable on SSD) r_await / w_await > 20 ms (HDD) or > 5 ms (SSD) → latency concern aqu-sz > 4 → serious queue length

I/O‑heavy processes (iotop)

# Show only processes doing I/O
iotop -oP
# Scriptable mode for automation
iotop -oP -b -n 5 -d 1

Per‑process I/O (pidstat -d)

# System‑wide I/O snapshot
pidstat -d 1 5
# Specific process
pidstat -d -p <PID> 1 5

Block‑device tracing (blktrace)

# Trace /dev/sda for 10 s
blktrace -d /dev/sda -o trace -w 10
# Parse trace
blkparse -i trace.blktrace.0 -o trace.txt
# Generate latency distribution
btt -i trace.blktrace.0 -o btt_output

Note: blktrace generates large data; run only for short periods.

Filesystem checks

# Filesystem usage
df -hT
# Inode usage (exhausted inodes also cause write failures)
df -i
# Largest directories
du -sh /* 2>/dev/null | sort -rh | head -10
# Deleted but still open files
lsof +L1

Network Investigation

Connection statistics (ss)

# Summary
ss -s
# TIME_WAIT count
ss -tan state time-wait | wc -l
# Top ESTABLISHED ports
ss -tn state established | awk '{print $4}' | awk -F: '{print $NF}' | sort | uniq -c | sort -rn | head -10
# Listening sockets
ss -tlnp

Guidelines: TIME_WAIT > 20 k is a warning; > 50 k suggests optimization. Compare ESTABLISHED count with application connection‑pool settings.

Network traffic (sar)

# Interface stats (real‑time)
sar -n DEV 1 5
# Errors
sar -n EDEV 1 5
# TCP stats
sar -n TCP,ETCP 1 5

Key metrics: rxkB/s / txkB/s – bandwidth vs NIC rating rxpck/s / txpck/s – packet rate (10 Mpps typical for 10 GbE) %ifutil – interface utilization retrans/s > 0 → packet loss

Protocol‑stack counters (nstat)

nstat -az | grep -E "TcpRetransSegs|TcpExtTCPLostRetransmit|TcpExtListenOverflows|TcpExtListenDrops|TcpExtTCPAbortOnMemory"

Important counters: TcpRetransSegs – growing value indicates loss TcpExtListenOverflows / TcpExtListenDrops – full listen queue, increase somaxconn and application backlog TcpExtTCPAbortOnMemory – connections aborted due to memory shortage

Packet capture (tcpdump)

# Capture 10 000 packets on port 80
tcpdump -i eth0 port 80 -w /tmp/capture.pcap -c 10000
# Capture traffic to a specific IP
tcpdump -i eth0 host 10.0.0.1 -nn
# Show only SYN packets (connection setup)
tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0' -nn
# Show only RST packets (connection reset)
tcpdump -i eth0 'tcp[tcpflags] & tcp-rst != 0' -nn

Warning: Always limit packet count ( -c) or duration ( -G) in production to avoid filling disks.

NIC‑level checks

# Error statistics
ethtool -S eth0 | grep -i error
# Drop statistics
ethtool -S eth0 | grep -i drop
# Ring buffer size
ethtool -g eth0
# Soft‑interrupt stats per CPU
cat /proc/net/softnet_stat
/proc/net/softnet_stat

columns per CPU:

1st – processed packets

2nd – budget exits (netdev_budget exhausted)

3rd – backlog drops (increase netdev_max_backlog)

Automation Scripts

One‑click performance snapshot

#!/bin/bash
set -euo pipefail
OUTPUT_DIR=${1:-/tmp/perf_snapshot_$(date +%Y%m%d_%H%M%S)}
mkdir -p "$OUTPUT_DIR"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
collect(){
  name=$1; shift
  log "Collecting $name ..."
  "$@" > "$OUTPUT_DIR/${name}.txt" 2>&1 || echo "Failed $name: $?" >> "$OUTPUT_DIR/errors.log"
}
log "Starting collection, output: $OUTPUT_DIR"
collect "os_release" cat /etc/os-release
collect "kernel" uname -a
collect "uptime" uptime
collect "date" date '+%Y-%m-%d %H:%M:%S %Z'
collect "hostname" hostname -f
collect "top_snapshot" bash -c "top -bn1 | head -50"
collect "mpstat" mpstat -P ALL 1 5
collect "pidstat_cpu" pidstat -u 1 5
collect "pidstat_ctx" pidstat -w 1 5
collect "free" free -h
collect "meminfo" cat /proc/meminfo
collect "slabtop" slabtop -o
collect "smem" bash -c "smem -rkt -s pss 2>/dev/null || echo 'smem not installed'"
collect "iostat" iostat -xz 1 5
collect "pidstat_io" pidstat -d 1 5
collect "df" df -hT
collect "df_inode" df -i
collect "ss_summary" ss -s
collect "ss_established" bash -c "ss -tn state established | head -100"
collect "ss_time_wait" bash -c "ss -tan state time-wait | wc -l"
collect "ss_listen" ss -tlnp
collect "netstat_stats" bash -c "nstat -az 2>/dev/null || netstat -s"
collect "vmstat" vmstat 1 10
collect "dmesg_errors" bash -c "dmesg -T 2>/dev/null | tail -100"
collect "dmesg_oom" bash -c "dmesg | grep -i 'oom\|out of memory\|killed process' || echo 'No OOM events'"
collect "ps_aux" bash -c "ps aux --sort=-%mem | head -30"
collect "ps_d_state" bash -c "ps aux | awk '$8~/D/' || echo 'No D state processes'"
collect "journal_recent" bash -c "journalctl --since '10 minutes ago' --no-pager 2>/dev/null | tail -200 || tail -200 /var/log/syslog 2>/dev/null || echo 'No syslog access'"
log "Collection finished, files:"
ls -la "$OUTPUT_DIR"
# Package results
tar czf "${OUTPUT_DIR}.tar.gz" -C "$(dirname "$OUTPUT_DIR")" "$(basename "$OUTPUT_DIR")"
log "Packaged: ${OUTPUT_DIR}.tar.gz (size: $(du -sh "${OUTPUT_DIR}.tar.gz" | awk '{print $1}')"

Continuous monitor script

#!/bin/bash
set -euo pipefail
INTERVAL=${1:-5}
DURATION=${2:-3600}
OUTPUT="/tmp/perf_monitor_$(date +%Y%m%d_%H%M%S).csv"
echo "timestamp,load1,load5,load15,cpu_us,cpu_sy,cpu_wa,cpu_st,mem_used_pct,swap_used_mb,disk_util,net_rx_kb,net_tx_kb,tcp_estab,tcp_tw,context_switch,interrupts" > "$OUTPUT"
END_TIME=$((SECONDS + DURATION))
while [ $SECONDS -lt $END_TIME ]; do
  TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
  read LOAD1 LOAD5 LOAD15 <<<$(awk '{print $1,$2,$3}' /proc/loadavg)
  read CPU_US CPU_SY CPU_WA CPU_ST <<<$(vmstat 1 2 | tail -1 | awk '{print $13,$14,$16,$17}')
  MEM_USED_PCT=$(free | awk '/Mem:/{printf "%.1f",($3/$2)*100}')
  SWAP_USED=$(free -m | awk '/Swap:/{print $3}')
  DISK_UTIL=$(iostat -xz 1 2 | awk '/^[a-z]/{if($NF+0>max) max=$NF+0} END{print max+0}')
  read NET_RX NET_TX <<<$(sar -n DEV 1 1 2>/dev/null | awk '/Average:/ && !/lo/{print $5,$6; exit}' || echo "0 0")
  TCP_ESTAB=$(ss -tn state established 2>/dev/null | wc -l)
  TCP_TW=$(ss -tan state time-wait 2>/dev/null | wc -l)
  read CS INTR <<<$(vmstat 1 2 | tail -1 | awk '{print $12,$11}')
  echo "$TIMESTAMP,$LOAD1,$LOAD5,$LOAD15,$CPU_US,$CPU_SY,$CPU_WA,$CPU_ST,$MEM_USED_PCT,$SWAP_USED,$DISK_UTIL,$NET_RX,$NET_TX,$TCP_ESTAB,$TCP_TW,$CS,$INTR" >> "$OUTPUT"
  sleep $INTERVAL
done
log "Monitoring finished, data file: $OUTPUT"
log "Total records: $(wc -l < "$OUTPUT")"

Real‑World Cases

Case 1 – Java CPU 100 %

Scenario: an online Java service spikes to 100 % CPU, causing timeouts.

# Identify the Java process
top -bn1 | grep java   # PID 12345 shows 398 % on a 4‑core box
# Find the hottest thread
top -Hp 12345 -bn1 | head -20   # Thread 12378 uses 98 %
# Convert thread ID to hex for jstack
printf "%x
" 12378   # 305a
# Dump Java stack traces
jstack 12345 > /tmp/jstack_12345.txt
# Locate the thread in the dump
grep -A 30 "nid=0x305a" /tmp/jstack_12345.txt

Root cause: a vulnerable regular expression caused catastrophic backtracking (ReDoS) in validateEmail.

Resolution:

Temporary: restart the service.

Permanent: replace the regex or add timeout controls.

Prevention: run ReDoS tests on new regexes.

Advanced analysis – Java flame graph:

# Install perf‑map‑agent
git clone https://github.com/jvm-profiling-tools/perf-map-agent.git
cd perf-map-agent && cmake . && make
# Create symbol map for the PID
bin/create-java-perf-map.sh 12345
# Sample for 30 s
perf record -p 12345 -g -- sleep 30
# Generate flame graph
perf script | /opt/FlameGraph/stackcollapse-perf.pl | /opt/FlameGraph/flamegraph.pl > java_flame.svg

Case 2 – MySQL I/O Wait High

Scenario: iowait > 30 %, MySQL slow queries increase, API timeouts.

# Verify I/O bottleneck
iostat -xz 1 3   # sda %util 95 %, w_await 45 ms
# Identify MySQL I/O
iotop -oP -b -n 3 -d 1   # mysqld writes 180 MB/s
# Show running queries
mysql -e "SHOW PROCESSLIST\G" | grep -B5 "Query"
# Inspect slow‑query log
tail -100 /var/log/mysql/slow.log
# Analyze with pt‑query‑digest
pt-query-digest /var/log/mysql/slow.log --since '1h' | head -50

Finding: a full‑table scan on orders (5 M rows) took 12.5 s.

Fix:

Add composite index (status, create_time).

Increase InnoDB buffer pool (e.g., 6 GB on a 16 GB machine).

Tune innodb_io_capacity, innodb_io_capacity_max, and disable innodb_flush_neighbors on SSD.

Restart MySQL.

Case 3 – High Load, Low CPU Utilization

Scenario: load average > 50 on an 8‑core box, but top shows only 30 % CPU usage.

# vmstat shows many D‑state processes
vmstat 1 5   # column b constantly > 40
# List D‑state processes
ps aux | awk '$8~/^D/{print $0}'
# Disk I/O is saturated
iostat -xz 1 3   # sda %util 100 %, await 850 ms
# Identify I/O hogs
iotop -oP -b -n 3
# Find the culprit (e.g., 40 rsync jobs)
crontab -l   # backup jobs run simultaneously

Root cause: backup tasks were not staggered, causing massive disk I/O and many processes entering D state.

Solution:

Temporary: kill some rsync processes.

Permanent: stagger backups (e.g., 5‑minute offsets).

Optimization: add --bwlimit=50000 to rsync to cap bandwidth.

Case 4 – Network Packet Loss

Scenario: application logs show many connection timeouts; occasional ping loss.

# Verify packet loss
ping -c 100 10.0.0.2   # 5 % loss
# NIC error stats
ethtool -S eth0 | grep -E "drop|error|fifo"
# Increase ring buffer if needed
ethtool -G eth0 rx 4096
# Softnet stats – backlog
cat /proc/net/softnet_stat
# Enable RPS to distribute soft interrupts
echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus
# Verify loss resolved
ping -c 100 10.0.0.2   # 0 % loss

Kernel Parameter Tuning (Production‑Validated)

Backup current values before changing:

sysctl -a > /tmp/sysctl_backup_$(date +%Y%m%d).conf

Memory‑related parameters:

# Swappiness – tendency to use swap
sysctl -w vm.swappiness=10
# Dirty page limits – avoid large writeback bursts
sysctl -w vm.dirty_ratio=10
sysctl -w vm.dirty_background_ratio=5
# Overcommit – keep default (0) for most services
sysctl -w vm.overcommit_memory=0

Network‑related parameters:

# Listen queue size
sysctl -w net.core.somaxconn=65535
# SYN backlog
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# NIC receive queue length
sysctl -w net.core.netdev_max_backlog=50000
# TIME_WAIT reuse (disable in NAT environments)
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_max_tw_buckets=50000
# TCP keepalive
sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=3
# TCP memory limits (example for 16 GB RAM)
sysctl -w net.ipv4.tcp_mem="262144 524288 786432"
# Socket buffers
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

File descriptor limits:

# System‑wide limit
sysctl -w fs.file-max=2097152
# Persist in /etc/sysctl.d/99-performance.conf
cat > /etc/sysctl.d/99-performance.conf <<'EOF'
vm.swappiness = 10
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 50000
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 3
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
fs.file-max = 2097152
EOF
sysctl -p /etc/sysctl.d/99-performance.conf

Security Hardening

Restrict perf usage to user‑space only:

# Allow only user‑space sampling
echo 1 > /proc/sys/kernel/perf_event_paranoid
# Hide kernel pointers
echo 1 > /proc/sys/kernel/kptr_restrict
# Audit execution of perf and tcpdump
auditctl -a always,exit -F path=/usr/bin/perf -F perm=x -k perf_usage
auditctl -a always,exit -F path=/usr/sbin/tcpdump -F perm=x -k tcpdump_usage

Delete captured pcap files after analysis to avoid leaking sensitive data.

Continuous Monitoring Stack

Deploy node_exporter + Prometheus + Grafana for long‑term metrics.

# Install node_exporter 1.7.0
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Systemd unit
cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
    --collector.systemd \
    --collector.processes \
    --collector.tcpstat \
    --web.listen-address=:9100
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF
useradd -r -s /sbin/nologin node_exporter
systemctl daemon-reload
systemctl enable --now node_exporter

Example Prometheus alert rules (node_alerts.yml):

groups:
- name: node_alerts
  rules:
  - alert: HighCpuUsage
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "CPU usage too high on {{ $labels.instance }}"
      description: "CPU usage {{ $value | printf \"%.1f\" }}% for over 5 minutes"
  - alert: HighMemoryUsage
    expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Memory usage too high on {{ $labels.instance }}"
      description: "Available memory <10%, current usage {{ $value | printf \"%.1f\" }}%"
  - alert: HighLoadAverage
    expr: node_load1 / count without(cpu,mode) (node_cpu_seconds_total{mode="idle"}) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Load average too high on {{ $labels.instance }}"
      description: "1‑minute load > 2×CPU cores"
  - alert: HighDiskUtilization
    expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 90
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Disk I/O saturation on {{ $labels.instance }}"
      description: "Device {{ $labels.device }} utilization >90%"
  - alert: DiskSpaceRunningOut
    expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"
      description: "Mount {{ $labels.mountpoint }} usage {{ $value | printf \"%.1f\" }}%"
  - alert: HighNetworkErrors
    expr: rate(node_network_receive_errs_total[5m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Network receive errors on {{ $labels.instance }}"
      description: "Device {{ $labels.device }} error rate {{ $value | printf \"%.1f\" }}/s"
  - alert: SwapUsageIncreasing
    expr: rate(node_memory_SwapFree_bytes[10m]) < -1048576
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Swap usage increasing on {{ $labels.instance }}"
      description: "Swap is continuously growing, possible memory leak"

Alert Thresholds

CPU usage (us+sy) > 85 % for 5 min → HighCpuUsage

Load average > 2 × CPU cores for 5 min → HighLoadAverage

Memory available < 10 % → HighMemoryUsage

Swap usage growing → SwapUsageIncreasing

Disk %util > 90 % for 1 min → HighDiskUtilization

Disk await > 30 ms (HDD) / > 10 ms (SSD) → performance degradation

Network retransmits > 1 % → HighNetworkErrors

TIME_WAIT > 50 000 → investigate connection handling

Context switches > 100 000 /s → possible CPU contention

Baseline Collection

Collect a 5‑minute baseline snapshot for later comparison:

# collect_baseline.sh – weekly 5‑minute baseline
set -euo pipefail
BASELINE_DIR="/opt/perf_baseline"
DATE=$(date +%Y%m%d)
OUTPUT="$BASELINE_DIR/baseline_$DATE"
mkdir -p "$OUTPUT"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
log "Collecting baseline…"
vmstat 1 300 > "$OUTPUT/vmstat.txt" &
iostat -xz 1 300 > "$OUTPUT/iostat.txt" &
mpstat -P ALL 1 300 > "$OUTPUT/mpstat.txt" &
sar -n DEV 1 300 > "$OUTPUT/sar_net.txt" &
pidstat -u -d -r 5 60 > "$OUTPUT/pidstat.txt" &
wait
free -h > "$OUTPUT/free.txt"
df -hT > "$OUTPUT/df.txt"
ss -s > "$OUTPUT/ss.txt"
cat /proc/meminfo > "$OUTPUT/meminfo.txt"
sysctl -a > "$OUTPUT/sysctl.txt" 2>/dev/null
# Keep last 30 days
find "$BASELINE_DIR" -name "baseline_*" -mtime +30 -exec rm -rf {} +
log "Baseline saved to $OUTPUT"

Summary

The USE methodology provides a disciplined, three‑dimensional view (Utilization, Saturation, Errors) for every resource type. A quick 60‑second global scan (load, top, vmstat, iostat, ss) points to the bottleneck direction; layered tools (mpstat, pidstat, perf, slabtop, iotop, blktrace, nstat, tcpdump) pinpoint the exact cause. Production‑validated kernel parameters, safe cache‑dropping, and controlled use of privileged tools ensure stability. Continuous monitoring with node_exporter, Prometheus and well‑crafted alert rules catches regressions early, while baseline snapshots and historical sar data aid post‑mortem analysis. Real‑world case studies illustrate typical CPU, memory, I/O and network problems and their step‑by‑step resolutions, giving a practical playbook for Linux performance engineering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringPerformanceLinuxtroubleshootingtopperf
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.