Linux System Performance Troubleshooting: Complete End‑to‑End Workflow from top to perf
This article presents a systematic, USE‑methodology‑based workflow for diagnosing Linux performance issues, covering CPU, memory, disk I/O and network bottlenecks with step‑by‑step commands, detailed examples, scripts, case studies, best‑practice recommendations and monitoring guidelines.
Overview
Performance incidents such as high latency, load spikes, OOM kills or packet loss are diagnosed with the USE methodology (Utilization / Saturation / Errors). The guide provides a four‑dimensional, layered workflow that covers CPU, memory, disk I/O and network, using only standard Linux tools and eBPF‑based utilities.
Environment Requirements
Operating system: CentOS 7+ or Ubuntu 20.04+ (kernel ≥ 4.9 for full perf/BPF support)
sysstat ≥ 11.0 (mpstat, iostat, pidstat, sar)
perf matching the running kernel (linux‑tools-$(uname -r))
bcc‑tools ≥ 0.12 (eBPF tracing)
dstat ≥ 0.7 (unified monitoring)
Quick 60‑Second Diagnosis
Load average (uptime)
Run uptime. The three numbers are 1‑, 5‑, and 15‑minute load averages. If 1 min > 5 min > 15 min the load is rising; compare the absolute value with the CPU core count (e.g., 8‑core machine: load ≈ 8 = full, load ≫ 8 = severe). Load includes D ‑state processes, so a high value does not always mean CPU saturation.
Global resource view (top)
Run top -bn1 | head -20. Key fields and alert thresholds: us > 80 % → user‑space CPU bound sy > 20 % → system‑call intensive or lock contention wa > 5 % → I/O wait st > 5 % → VM steal (hypervisor over‑commit) si high → network soft‑interrupt overload
System‑level snapshot (vmstat)
Run vmstat 1 10. Important columns: r > CPU cores → run‑queue saturation b > 0 → processes in uninterruptible sleep (I/O wait) si/so > 0 → swap activity (memory pressure) cs > ≈ 100 k/s per core is a warning (context switches) wa → I/O wait percentage
Unified real‑time monitoring (dstat)
Run dstat -tcmsdnl --top-cpu --top-io 5. Options: -t timestamp -c CPU -m memory -s swap -d disk -n network --top-cpu highest CPU consumers --top-io highest I/O consumers
CPU Deep Dive
Per‑CPU utilization (mpstat)
mpstat -P ALL 1 5Look for a core stuck at 100 % (single‑thread bottleneck) or high %sys (system‑call intensive) or high %soft (network soft‑interrupt concentration). Check /proc/interrupts if needed.
Process‑level CPU (pidstat)
# Show processes using >10 % CPU every second
pidstat -u 1 5 | awk 'NR<=3 || $8>10'
# Show threads of a specific PID
pidstat -u -t -p <PID> 1 5Hot functions (perf top / perf record)
# Real‑time hot functions
perf top -p <PID>
# Sample for 30 s and generate a flame graph
perf record -p <PID> -g -- sleep 30
perf script | /opt/FlameGraph/stackcollapse-perf.pl \
| /opt/FlameGraph/flamegraph.pl > cpu_flame.svg
# Install FlameGraph repository
git clone https://github.com/brendangregg/FlameGraph.git /opt/FlameGraphNote: perf top requires root; non‑root users can set /proc/sys/kernel/perf_event_paranoid to 1.
Memory Investigation
Memory overview (free)
free -hFocus on the available column (includes reclaimable cache). available < 10 % of total RAM is a warning. Any non‑zero Swap usage that grows indicates physical memory shortage. The shared column shows tmpfs usage (e.g., PostgreSQL shared memory).
Detailed metrics (/proc/meminfo)
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Slab|SReclaimable|SUnreclaim|Dirty|Writeback|AnonPages|Mapped|Shmem|HugePages_Total"Key fields: Slab/SUnreclaim – kernel slab usage; growing SUnreclaim may indicate a kernel memory leak. Dirty/Writeback – high values mean dirty pages cannot be flushed fast enough. AnonPages – actual memory used by processes. HugePages_Total – large‑page allocation (databases).
Kernel slab analysis (slabtop)
slabtop -o | head -20 # sort by memory size
slabtop -d 2 # continuous monitoring every 2 s
# Drop caches (requires root)
echo 3 > /proc/sys/vm/drop_caches # pagecache + dentries + inodes
# Production recommendation: drop only dentries and inodes
echo 2 > /proc/sys/vm/drop_cachesWarning: Dropping caches causes a short I/O spike; avoid during peak traffic.
Process memory (smem)
# Install
yum install -y smem # CentOS
apt install -y smem # Ubuntu
# Show top processes by PSS (proportional set size)
smem -rkt -s pss | head -20PSS distributes shared library memory proportionally, giving a more accurate per‑process footprint than RSS. Example: 50 PHP‑FPM workers show 8 GB RSS total but only 3 GB PSS.
Process‑level memory details
# Memory map of a process
pmap -x <PID> | tail -1
# Detailed status
cat /proc/<PID>/status | grep -E "VmSize|VmRSS|VmSwap|Threads"
# Full smaps summary (requires root)
cat /proc/<PID>/smaps_rollupOOM Killer investigation
# Recent OOM logs
dmesg | grep -i "oom\|out of memory" | tail -20
# Which process was killed
dmesg | grep "Killed process"
# Current OOM score (higher = more likely to be killed)
cat /proc/<PID>/oom_score
# Adjust OOM priority (‑1000 = never killed)
echo -1000 > /proc/<PID>/oom_score_adjDisk I/O Investigation
Device utilization (iostat)
iostat -xz 1 5Important columns and alert thresholds: %util > 80 % → near saturation on HDD (higher acceptable on SSD) r_await / w_await > 20 ms (HDD) or > 5 ms (SSD) → latency concern aqu-sz > 4 → serious queue length
I/O‑heavy processes (iotop)
# Show only processes doing I/O
iotop -oP
# Scriptable mode for automation
iotop -oP -b -n 5 -d 1Per‑process I/O (pidstat -d)
# System‑wide I/O snapshot
pidstat -d 1 5
# Specific process
pidstat -d -p <PID> 1 5Block‑device tracing (blktrace)
# Trace /dev/sda for 10 s
blktrace -d /dev/sda -o trace -w 10
# Parse trace
blkparse -i trace.blktrace.0 -o trace.txt
# Generate latency distribution
btt -i trace.blktrace.0 -o btt_outputNote: blktrace generates large data; run only for short periods.
Filesystem checks
# Filesystem usage
df -hT
# Inode usage (exhausted inodes also cause write failures)
df -i
# Largest directories
du -sh /* 2>/dev/null | sort -rh | head -10
# Deleted but still open files
lsof +L1Network Investigation
Connection statistics (ss)
# Summary
ss -s
# TIME_WAIT count
ss -tan state time-wait | wc -l
# Top ESTABLISHED ports
ss -tn state established | awk '{print $4}' | awk -F: '{print $NF}' | sort | uniq -c | sort -rn | head -10
# Listening sockets
ss -tlnpGuidelines: TIME_WAIT > 20 k is a warning; > 50 k suggests optimization. Compare ESTABLISHED count with application connection‑pool settings.
Network traffic (sar)
# Interface stats (real‑time)
sar -n DEV 1 5
# Errors
sar -n EDEV 1 5
# TCP stats
sar -n TCP,ETCP 1 5Key metrics: rxkB/s / txkB/s – bandwidth vs NIC rating rxpck/s / txpck/s – packet rate (10 Mpps typical for 10 GbE) %ifutil – interface utilization retrans/s > 0 → packet loss
Protocol‑stack counters (nstat)
nstat -az | grep -E "TcpRetransSegs|TcpExtTCPLostRetransmit|TcpExtListenOverflows|TcpExtListenDrops|TcpExtTCPAbortOnMemory"Important counters: TcpRetransSegs – growing value indicates loss TcpExtListenOverflows / TcpExtListenDrops – full listen queue, increase somaxconn and application backlog TcpExtTCPAbortOnMemory – connections aborted due to memory shortage
Packet capture (tcpdump)
# Capture 10 000 packets on port 80
tcpdump -i eth0 port 80 -w /tmp/capture.pcap -c 10000
# Capture traffic to a specific IP
tcpdump -i eth0 host 10.0.0.1 -nn
# Show only SYN packets (connection setup)
tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0' -nn
# Show only RST packets (connection reset)
tcpdump -i eth0 'tcp[tcpflags] & tcp-rst != 0' -nnWarning: Always limit packet count ( -c) or duration ( -G) in production to avoid filling disks.
NIC‑level checks
# Error statistics
ethtool -S eth0 | grep -i error
# Drop statistics
ethtool -S eth0 | grep -i drop
# Ring buffer size
ethtool -g eth0
# Soft‑interrupt stats per CPU
cat /proc/net/softnet_stat /proc/net/softnet_statcolumns per CPU:
1st – processed packets
2nd – budget exits (netdev_budget exhausted)
3rd – backlog drops (increase netdev_max_backlog)
Automation Scripts
One‑click performance snapshot
#!/bin/bash
set -euo pipefail
OUTPUT_DIR=${1:-/tmp/perf_snapshot_$(date +%Y%m%d_%H%M%S)}
mkdir -p "$OUTPUT_DIR"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
collect(){
name=$1; shift
log "Collecting $name ..."
"$@" > "$OUTPUT_DIR/${name}.txt" 2>&1 || echo "Failed $name: $?" >> "$OUTPUT_DIR/errors.log"
}
log "Starting collection, output: $OUTPUT_DIR"
collect "os_release" cat /etc/os-release
collect "kernel" uname -a
collect "uptime" uptime
collect "date" date '+%Y-%m-%d %H:%M:%S %Z'
collect "hostname" hostname -f
collect "top_snapshot" bash -c "top -bn1 | head -50"
collect "mpstat" mpstat -P ALL 1 5
collect "pidstat_cpu" pidstat -u 1 5
collect "pidstat_ctx" pidstat -w 1 5
collect "free" free -h
collect "meminfo" cat /proc/meminfo
collect "slabtop" slabtop -o
collect "smem" bash -c "smem -rkt -s pss 2>/dev/null || echo 'smem not installed'"
collect "iostat" iostat -xz 1 5
collect "pidstat_io" pidstat -d 1 5
collect "df" df -hT
collect "df_inode" df -i
collect "ss_summary" ss -s
collect "ss_established" bash -c "ss -tn state established | head -100"
collect "ss_time_wait" bash -c "ss -tan state time-wait | wc -l"
collect "ss_listen" ss -tlnp
collect "netstat_stats" bash -c "nstat -az 2>/dev/null || netstat -s"
collect "vmstat" vmstat 1 10
collect "dmesg_errors" bash -c "dmesg -T 2>/dev/null | tail -100"
collect "dmesg_oom" bash -c "dmesg | grep -i 'oom\|out of memory\|killed process' || echo 'No OOM events'"
collect "ps_aux" bash -c "ps aux --sort=-%mem | head -30"
collect "ps_d_state" bash -c "ps aux | awk '$8~/D/' || echo 'No D state processes'"
collect "journal_recent" bash -c "journalctl --since '10 minutes ago' --no-pager 2>/dev/null | tail -200 || tail -200 /var/log/syslog 2>/dev/null || echo 'No syslog access'"
log "Collection finished, files:"
ls -la "$OUTPUT_DIR"
# Package results
tar czf "${OUTPUT_DIR}.tar.gz" -C "$(dirname "$OUTPUT_DIR")" "$(basename "$OUTPUT_DIR")"
log "Packaged: ${OUTPUT_DIR}.tar.gz (size: $(du -sh "${OUTPUT_DIR}.tar.gz" | awk '{print $1}')"Continuous monitor script
#!/bin/bash
set -euo pipefail
INTERVAL=${1:-5}
DURATION=${2:-3600}
OUTPUT="/tmp/perf_monitor_$(date +%Y%m%d_%H%M%S).csv"
echo "timestamp,load1,load5,load15,cpu_us,cpu_sy,cpu_wa,cpu_st,mem_used_pct,swap_used_mb,disk_util,net_rx_kb,net_tx_kb,tcp_estab,tcp_tw,context_switch,interrupts" > "$OUTPUT"
END_TIME=$((SECONDS + DURATION))
while [ $SECONDS -lt $END_TIME ]; do
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
read LOAD1 LOAD5 LOAD15 <<<$(awk '{print $1,$2,$3}' /proc/loadavg)
read CPU_US CPU_SY CPU_WA CPU_ST <<<$(vmstat 1 2 | tail -1 | awk '{print $13,$14,$16,$17}')
MEM_USED_PCT=$(free | awk '/Mem:/{printf "%.1f",($3/$2)*100}')
SWAP_USED=$(free -m | awk '/Swap:/{print $3}')
DISK_UTIL=$(iostat -xz 1 2 | awk '/^[a-z]/{if($NF+0>max) max=$NF+0} END{print max+0}')
read NET_RX NET_TX <<<$(sar -n DEV 1 1 2>/dev/null | awk '/Average:/ && !/lo/{print $5,$6; exit}' || echo "0 0")
TCP_ESTAB=$(ss -tn state established 2>/dev/null | wc -l)
TCP_TW=$(ss -tan state time-wait 2>/dev/null | wc -l)
read CS INTR <<<$(vmstat 1 2 | tail -1 | awk '{print $12,$11}')
echo "$TIMESTAMP,$LOAD1,$LOAD5,$LOAD15,$CPU_US,$CPU_SY,$CPU_WA,$CPU_ST,$MEM_USED_PCT,$SWAP_USED,$DISK_UTIL,$NET_RX,$NET_TX,$TCP_ESTAB,$TCP_TW,$CS,$INTR" >> "$OUTPUT"
sleep $INTERVAL
done
log "Monitoring finished, data file: $OUTPUT"
log "Total records: $(wc -l < "$OUTPUT")"Real‑World Cases
Case 1 – Java CPU 100 %
Scenario: an online Java service spikes to 100 % CPU, causing timeouts.
# Identify the Java process
top -bn1 | grep java # PID 12345 shows 398 % on a 4‑core box
# Find the hottest thread
top -Hp 12345 -bn1 | head -20 # Thread 12378 uses 98 %
# Convert thread ID to hex for jstack
printf "%x
" 12378 # 305a
# Dump Java stack traces
jstack 12345 > /tmp/jstack_12345.txt
# Locate the thread in the dump
grep -A 30 "nid=0x305a" /tmp/jstack_12345.txtRoot cause: a vulnerable regular expression caused catastrophic backtracking (ReDoS) in validateEmail.
Resolution:
Temporary: restart the service.
Permanent: replace the regex or add timeout controls.
Prevention: run ReDoS tests on new regexes.
Advanced analysis – Java flame graph:
# Install perf‑map‑agent
git clone https://github.com/jvm-profiling-tools/perf-map-agent.git
cd perf-map-agent && cmake . && make
# Create symbol map for the PID
bin/create-java-perf-map.sh 12345
# Sample for 30 s
perf record -p 12345 -g -- sleep 30
# Generate flame graph
perf script | /opt/FlameGraph/stackcollapse-perf.pl | /opt/FlameGraph/flamegraph.pl > java_flame.svgCase 2 – MySQL I/O Wait High
Scenario: iowait > 30 %, MySQL slow queries increase, API timeouts.
# Verify I/O bottleneck
iostat -xz 1 3 # sda %util 95 %, w_await 45 ms
# Identify MySQL I/O
iotop -oP -b -n 3 -d 1 # mysqld writes 180 MB/s
# Show running queries
mysql -e "SHOW PROCESSLIST\G" | grep -B5 "Query"
# Inspect slow‑query log
tail -100 /var/log/mysql/slow.log
# Analyze with pt‑query‑digest
pt-query-digest /var/log/mysql/slow.log --since '1h' | head -50Finding: a full‑table scan on orders (5 M rows) took 12.5 s.
Fix:
Add composite index (status, create_time).
Increase InnoDB buffer pool (e.g., 6 GB on a 16 GB machine).
Tune innodb_io_capacity, innodb_io_capacity_max, and disable innodb_flush_neighbors on SSD.
Restart MySQL.
Case 3 – High Load, Low CPU Utilization
Scenario: load average > 50 on an 8‑core box, but top shows only 30 % CPU usage.
# vmstat shows many D‑state processes
vmstat 1 5 # column b constantly > 40
# List D‑state processes
ps aux | awk '$8~/^D/{print $0}'
# Disk I/O is saturated
iostat -xz 1 3 # sda %util 100 %, await 850 ms
# Identify I/O hogs
iotop -oP -b -n 3
# Find the culprit (e.g., 40 rsync jobs)
crontab -l # backup jobs run simultaneouslyRoot cause: backup tasks were not staggered, causing massive disk I/O and many processes entering D state.
Solution:
Temporary: kill some rsync processes.
Permanent: stagger backups (e.g., 5‑minute offsets).
Optimization: add --bwlimit=50000 to rsync to cap bandwidth.
Case 4 – Network Packet Loss
Scenario: application logs show many connection timeouts; occasional ping loss.
# Verify packet loss
ping -c 100 10.0.0.2 # 5 % loss
# NIC error stats
ethtool -S eth0 | grep -E "drop|error|fifo"
# Increase ring buffer if needed
ethtool -G eth0 rx 4096
# Softnet stats – backlog
cat /proc/net/softnet_stat
# Enable RPS to distribute soft interrupts
echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus
# Verify loss resolved
ping -c 100 10.0.0.2 # 0 % lossKernel Parameter Tuning (Production‑Validated)
Backup current values before changing:
sysctl -a > /tmp/sysctl_backup_$(date +%Y%m%d).confMemory‑related parameters:
# Swappiness – tendency to use swap
sysctl -w vm.swappiness=10
# Dirty page limits – avoid large writeback bursts
sysctl -w vm.dirty_ratio=10
sysctl -w vm.dirty_background_ratio=5
# Overcommit – keep default (0) for most services
sysctl -w vm.overcommit_memory=0Network‑related parameters:
# Listen queue size
sysctl -w net.core.somaxconn=65535
# SYN backlog
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# NIC receive queue length
sysctl -w net.core.netdev_max_backlog=50000
# TIME_WAIT reuse (disable in NAT environments)
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_max_tw_buckets=50000
# TCP keepalive
sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=3
# TCP memory limits (example for 16 GB RAM)
sysctl -w net.ipv4.tcp_mem="262144 524288 786432"
# Socket buffers
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"File descriptor limits:
# System‑wide limit
sysctl -w fs.file-max=2097152
# Persist in /etc/sysctl.d/99-performance.conf
cat > /etc/sysctl.d/99-performance.conf <<'EOF'
vm.swappiness = 10
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 50000
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 3
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
fs.file-max = 2097152
EOF
sysctl -p /etc/sysctl.d/99-performance.confSecurity Hardening
Restrict perf usage to user‑space only:
# Allow only user‑space sampling
echo 1 > /proc/sys/kernel/perf_event_paranoid
# Hide kernel pointers
echo 1 > /proc/sys/kernel/kptr_restrict
# Audit execution of perf and tcpdump
auditctl -a always,exit -F path=/usr/bin/perf -F perm=x -k perf_usage
auditctl -a always,exit -F path=/usr/sbin/tcpdump -F perm=x -k tcpdump_usageDelete captured pcap files after analysis to avoid leaking sensitive data.
Continuous Monitoring Stack
Deploy node_exporter + Prometheus + Grafana for long‑term metrics.
# Install node_exporter 1.7.0
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Systemd unit
cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
Type=simple
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.tcpstat \
--web.listen-address=:9100
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
useradd -r -s /sbin/nologin node_exporter
systemctl daemon-reload
systemctl enable --now node_exporterExample Prometheus alert rules (node_alerts.yml):
groups:
- name: node_alerts
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage too high on {{ $labels.instance }}"
description: "CPU usage {{ $value | printf \"%.1f\" }}% for over 5 minutes"
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage too high on {{ $labels.instance }}"
description: "Available memory <10%, current usage {{ $value | printf \"%.1f\" }}%"
- alert: HighLoadAverage
expr: node_load1 / count without(cpu,mode) (node_cpu_seconds_total{mode="idle"}) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Load average too high on {{ $labels.instance }}"
description: "1‑minute load > 2×CPU cores"
- alert: HighDiskUtilization
expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 90
for: 1m
labels:
severity: critical
annotations:
summary: "Disk I/O saturation on {{ $labels.instance }}"
description: "Device {{ $labels.device }} utilization >90%"
- alert: DiskSpaceRunningOut
expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Mount {{ $labels.mountpoint }} usage {{ $value | printf \"%.1f\" }}%"
- alert: HighNetworkErrors
expr: rate(node_network_receive_errs_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Network receive errors on {{ $labels.instance }}"
description: "Device {{ $labels.device }} error rate {{ $value | printf \"%.1f\" }}/s"
- alert: SwapUsageIncreasing
expr: rate(node_memory_SwapFree_bytes[10m]) < -1048576
for: 10m
labels:
severity: warning
annotations:
summary: "Swap usage increasing on {{ $labels.instance }}"
description: "Swap is continuously growing, possible memory leak"Alert Thresholds
CPU usage (us+sy) > 85 % for 5 min → HighCpuUsage
Load average > 2 × CPU cores for 5 min → HighLoadAverage
Memory available < 10 % → HighMemoryUsage
Swap usage growing → SwapUsageIncreasing
Disk %util > 90 % for 1 min → HighDiskUtilization
Disk await > 30 ms (HDD) / > 10 ms (SSD) → performance degradation
Network retransmits > 1 % → HighNetworkErrors
TIME_WAIT > 50 000 → investigate connection handling
Context switches > 100 000 /s → possible CPU contention
Baseline Collection
Collect a 5‑minute baseline snapshot for later comparison:
# collect_baseline.sh – weekly 5‑minute baseline
set -euo pipefail
BASELINE_DIR="/opt/perf_baseline"
DATE=$(date +%Y%m%d)
OUTPUT="$BASELINE_DIR/baseline_$DATE"
mkdir -p "$OUTPUT"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
log "Collecting baseline…"
vmstat 1 300 > "$OUTPUT/vmstat.txt" &
iostat -xz 1 300 > "$OUTPUT/iostat.txt" &
mpstat -P ALL 1 300 > "$OUTPUT/mpstat.txt" &
sar -n DEV 1 300 > "$OUTPUT/sar_net.txt" &
pidstat -u -d -r 5 60 > "$OUTPUT/pidstat.txt" &
wait
free -h > "$OUTPUT/free.txt"
df -hT > "$OUTPUT/df.txt"
ss -s > "$OUTPUT/ss.txt"
cat /proc/meminfo > "$OUTPUT/meminfo.txt"
sysctl -a > "$OUTPUT/sysctl.txt" 2>/dev/null
# Keep last 30 days
find "$BASELINE_DIR" -name "baseline_*" -mtime +30 -exec rm -rf {} +
log "Baseline saved to $OUTPUT"Summary
The USE methodology provides a disciplined, three‑dimensional view (Utilization, Saturation, Errors) for every resource type. A quick 60‑second global scan (load, top, vmstat, iostat, ss) points to the bottleneck direction; layered tools (mpstat, pidstat, perf, slabtop, iotop, blktrace, nstat, tcpdump) pinpoint the exact cause. Production‑validated kernel parameters, safe cache‑dropping, and controlled use of privileged tools ensure stability. Continuous monitoring with node_exporter, Prometheus and well‑crafted alert rules catches regressions early, while baseline snapshots and historical sar data aid post‑mortem analysis. Real‑world case studies illustrate typical CPU, memory, I/O and network problems and their step‑by‑step resolutions, giving a practical playbook for Linux performance engineering.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
