Operations 22 min read

Linux Performance Troubleshooting Cheat Sheet: Diagnose CPU, Memory, Disk & Network

This guide provides a comprehensive, system-wide Linux performance troubleshooting cheat sheet covering CPU, memory, disk I/O, network, and process metrics, offering over 20 essential commands with usage scenarios, key parameters, output interpretation, alert thresholds, and practical case studies to quickly pinpoint and resolve production issues.

Open Source Linux

Nov 2, 2025

Linux Performance Troubleshooting Cheat Sheet: Diagnose CPU, Memory, Disk & Network

Introduction

In production environments, performance problems can appear suddenly, causing slow responses, user complaints, and high server load. Operators must quickly identify whether the root cause is CPU saturation, memory leaks, disk I/O bottlenecks, or network congestion.

This article presents a systematic Linux performance troubleshooting cheat sheet covering CPU, memory, disk, network, and process dimensions, summarizing more than 20 practical commands and tools. Each tool includes usage scenarios, key parameters, output interpretation, and optimization guidance.

Applicable scenarios: emergency fault handling, routine inspection, performance tuning, capacity planning, and technical interview preparation.

Technical Background

Linux Performance Metric Model

The Linux performance observation follows the USE methodology (Utilization, Saturation, Errors) and the RED methodology (Rate, Errors, Duration).

CPU dimension : utilization, context switches, run queue, CPU cache hit rate

Memory dimension : physical memory usage, page cache, swap, memory reclaim pressure, OOM events

Disk I/O dimension : IOPS, throughput, I/O queue depth, average response time

Network dimension : bandwidth utilization, packet loss rate, retransmission rate, connection state distribution

Process dimension : process state, file descriptor usage, system call tracing

Core Content

1. CPU Performance Diagnosis

Quick load view

# 查看系统负载（1/5/15分钟平均值）
uptime
# 输出：load average: 2.15, 1.98, 1.75

# 实时监控（按1键切换CPU视图）
top
# 关键指标：%Cpu(s): us user | sy system | id idle | wa I/O wait | st steal

Per‑core statistics

# 安装sysstat
sudo apt install sysstat -y   # Ubuntu/Debian
sudo yum install sysstat -y   # RHEL/CentOS

# 每2秒显示所有CPU核心
mpstat -P ALL 2
# 示例输出：CPU 0 %usr=78%, CPU 1 %usr=32.5%

Performance event sampling

# 采样30秒CPU事件（需root）
sudo perf record -F 99 -a -g -- sleep 30

# 查看报告
sudo perf report --stdio

# 生成火焰图
sudo perf script | \
  FlameGraph/stackcollapse-perf.pl | \
  FlameGraph/flamegraph.pl > flame.svg

Alert thresholds

Load Average > number of CPUs → overload

%iowait > 20% → disk I/O bottleneck

%sy > 30% → possible kernel module issue

2. Memory Management Diagnosis

Memory overview

# 显示内存统计（-h 人类可读格式）
free -h
# 示例输出：
# Mem:   15Gi  8.2Gi  1.5Gi  256Mi  5.8Gi  6.5Gi
# total used free shared buff/cache available
# 重点关注 available（真实可用内存，Linux 3.14+）

Process memory details

# 按内存使用排序进程
ps aux --sort=-%mem | head -20

# 查看特定进程内存映射（替换 <PID>）
sudo cat /proc/<PID>/smaps_rollup

# 关键字段：
# Rss: actual physical memory
# Pss: proportionally shared memory (more accurate)
# Private_Dirty: memory exclusively owned and modified (check leaks)

Memory pressure monitoring

# 实时监控内存与Swap使用
vmstat 3
# 关键字段：
# si/so: Swap In/Out (non‑zero indicates memory pressure)
# r: run queue (>2×CPU cores indicates CPU saturation)

Alert conditions

available < 10% and Swap used > 50% → investigate memory pressure

si/so continuously > 100 pages/s → need more memory

dmesg shows "OOM Killer" → severe memory shortage

3. Disk I/O Performance

Disk I/O statistics

# 每2秒刷新扩展统计
iostat -xz 2

# 关键指标解读：
# %util: device utilization (>80% indicates saturation)
# await: average I/O response time (HDD <10ms, SSD <1ms)
# r/s + w/s: IOPS
# rrqm/s: read request merge rate (higher is better)

Process‑level I/O monitoring

# 显示哪个进程在进行I/O（按I/O排序）
sudo iotop -o
# 输出示例：显示各进程的读写速度（KB/s）与 I/O 百分比

Disk benchmark

# 安装 fio
sudo apt install fio -y

# 顺序写测试（在 /tmp/lab，避免破坏生产）
sudo fio --name=seqwrite --rw=write --bs=128k --size=1G \
  --numjobs=1 --runtime=60 --directory=/tmp/lab

# 测试后清理
sudo rm -rf /tmp/lab

Performance baseline

SSD random read: >100k IOPS, latency <1ms

HDD sequential read: >500 IOPS, latency <10ms

%util sustained >80% → need capacity expansion or optimization

4. Network Performance Diagnosis

Connection state

# 统计各状态 TCP 连接数
ss -tan | awk '{print $1}' | sort | uniq -c
# 示例输出：
# 45 ESTAB（已建立连接）
# 12 TIME-WAIT（等待关闭）
# 3 LISTEN（监听状态）
# 告警条件：TIME-WAIT > 1000 或 CLOSE-WAIT 过多

Traffic monitoring

# 安装 iftop
sudo apt install iftop -y

# 实时监控 eth0 流量
sudo iftop -i eth0
# 显示每个连接的带宽占用（发送/接收速率）

Packet capture

# 抓取 80 端口流量（保存为 pcap 文件）
sudo tcpdump -i eth0 port 80 -w /tmp/http.pcap -c 1000

# 读取并查看
tcpdump -r /tmp/http.pcap -nn | head -50

# 高级过滤：抓取特定 IP 的 SYN 包
sudo tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0 and src host 192.168.1.100'

NIC hardware statistics

# 查看网卡统计（丢包、错误、冲突）
ethtool -S eth0 | grep -E 'drop|error|coll'

# 查看协商速率
ethtool eth0 | grep Speed
# 应确保 1000Mb/s 或更高

5. Process & System Call Analysis

Process tree and state

# 显示进程层级关系
pstree -p | grep <进程名>

# 查看僵尸进程
ps aux | awk '$8 ~ /Z/ {print}'
# 状态为 Z 的进程需终止其父进程或修复代码

# 查看进程文件描述符占用
ls -la /proc/<PID>/fd | wc -l

System call tracing

# 追踪运行中的进程系统调用（替换 <PID>）
sudo strace -p <PID> -T -tt -e trace=open,read,write,close -o /tmp/strace.log

# 启动程序并追踪
strace -T -tt curl https://example.com 2>&1 | head -100

# 参数说明：
# -T: show syscall duration
# -tt: microsecond timestamps
# -e trace: trace selected syscalls

Library function tracing

# 追踪动态库函数调用
sudo ltrace -p <PID> -o /tmp/ltrace.log
# 适用于排查第三方库问题

6. System Log Analysis

Kernel logs

# 查看最近内核消息（含硬件错误、OOM 事件）
sudo dmesg -T | tail -200

# 搜索 OOM Killer 事件
sudo dmesg -T | grep -i 'killed process'

# systemd 系统使用 journalctl
sudo journalctl -k -p err -n 100

Application log locations

Ubuntu/Debian: /var/log/syslog, /var/log/auth.log RHEL/CentOS: /var/log/messages,

/var/log/secure

# 实时监控系统日志
sudo tail -f /var/log/syslog

# 搜索 SSH 登录失败
sudo grep 'Failed password' /var/log/auth.log | tail -50

7. Kernel Parameter Optimization

View and configure

# 显示所有内核参数
sysctl -a | less

# 查看特定参数
sysctl net.core.somaxconn

# 临时修改（重启失效）
sudo sysctl -w net.core.somaxconn=8192

# 永久修改
echo "net.core.somaxconn = 8192" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Common tuning parameters

# TCP connection queue size (high concurrency recommended ≥4096)
net.core.somaxconn = 8192

# Enable TIME‑WAIT socket reuse
net.ipv4.tcp_tw_reuse = 1

# File descriptor limit
fs.file-max = 1048576

# Virtual memory dirty page ratios (reduce I/O blocking)
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

Backup before changes

# 备份 sysctl 配置
sudo cp /etc/sysctl.conf /etc/sysctl.conf.bak

Practical Cases

Case 1: CPU spikes to 100%

Symptom : Server CPU usage suddenly reaches 100%, order processing times out.

Investigation steps

# Step 1: Locate offending process
top -c   # find java process PID 1234 using 380%

# Step 2: Thread‑level analysis
top -H -p 1234   # thread ID 1250 consumes 98%

# Step 3: Get thread stack
printf '%x
' 1250   # convert to 0x4e2
sudo jstack 1234 | grep -A 20 '0x4e2'   # shows infinite loop in coupon calculation

# Step 4: Performance sampling verification
sudo perf record -p 1234 -g -- sleep 10
sudo perf report --stdio | head -50

Optimization

Immediate: restart application, enable rate limiting

Code fix: optimise algorithm, add caching

Capacity planning: vertical scale to 8 cores, horizontal add 3 nodes

Result : CPU average drops from 95% to 45%; P99 response time improves from 4800 ms to 220 ms.

Case 2: Memory leak detection

Symptom : Service restarts after 3 days, logs show OOM Killer.

Investigation steps

# Confirm OOM event
sudo dmesg -T | grep -i 'killed process'

# Memory usage tracking (every 5 s)
while true; do
  ps -p 5678 -o pid,vsz,rss,%mem,cmd >> /tmp/mem_monitor.log
  sleep 5
done

# Analyse trend (RSS grew from 2 GB to 14 GB)
awk '{print $3}' /tmp/mem_monitor.log | head -100

# Detailed memory mapping
sudo cat /proc/5678/smaps_rollup
# Private_Dirty: 13 GB → abnormal

# Use valgrind to detect leaks
valgrind --leak-check=full ./video_service

Fix : after buffer use, call av_free() to release memory and set systemd memory limits.

Result : Service runs stably for 30 days, memory stays around 3.5 GB, OOM events eliminated.

Case 3: Disk I/O bottleneck

Symptom : Database query latency rises from 50 ms to 2 s, slow‑query count surges.

Investigation steps

# I/O statistics
iostat -xz 2
# %util: 99.8, await: 152.3 ms (abnormal)

# Locate I/O‑heavy process
sudo iotop -o   # TID 3456 mysqld reads 45.2 M/s

# Analyse slow queries
sudo cat /var/log/mysql/mysql-slow.log | tail -100
# many full table scans, missing indexes

# Block device tracing (10 s sample)
sudo blktrace -d /dev/sda -o - | blkparse -i - | head -200

Optimization

Add appropriate indexes (ALTER TABLE ADD INDEX)

Migrate data directory to SSD

Change I/O scheduler: echo mq-deadline > /sys/block/sda/queue/scheduler

Result : Disk %util drops from 99.8% to 35%; P95 query latency improves from 1800 ms to 55 ms.

Best‑Practice Checklist

Establish performance baselines during low‑traffic periods.

Follow layered diagnosis: macro (load/top) then micro (perf/strace).

Preserve fault evidence before restarting services.

Standardise tool versions (sysstat, perf, iotop ≥12.x).

Automate health checks with cron‑collected sar data (30‑day retention).

Avoid destructive actions on production mounts; test in /tmp/lab.

Configure log rotation to prevent /var/log from filling disks.

Apply packet‑capture limits (count or duration) and obtain approvals for sensitive environments.

Follow kernel‑parameter change workflow: backup → validate → apply → monitor 24 h.

Cross‑validate metrics (perf vs sar) before conclusions.

Trigger capacity planning when resource usage exceeds 70%.

Implement tiered alerting: P0 immediate, P1/P2 graded response.

Document common troubleshooting paths in a runbook.

Enforce least‑privilege sudo policies.

Conduct quarterly performance‑failure drills.

Summary & Outlook

This guide systematically outlines Linux performance troubleshooting tools and methodologies, providing three complete case studies to help operators quickly pinpoint issues.

Key takeaways :

Apply USE/RED methodology for layered diagnosis.

Monitor critical metric thresholds.

Tailor optimisations to business scenarios.

Keep the monitoring toolchain up‑to‑date.

Technical trends :

eBPF adoption: bpftrace replaces traditional tools without kernel changes.

Observability standardisation: OpenTelemetry integrates metrics, logs, and traces.

AI‑assisted diagnosis: anomaly detection based on historical data.

Container‑native monitoring: cAdvisor for Kubernetes environments.

FAQ

Q1: What Load Average is considered abnormal? Normal: single‑core < 1.0, multi‑core < core count (e.g., 4‑core system Load < 4.0). High %wa indicates I/O bottleneck rather than CPU saturation.

Q2: Does 90% memory usage require scaling? Not necessarily. Check the available field (>10% is healthy) and swap usage. Persistent growth with non‑zero si/so suggests scaling.

Q3: How to choose an I/O scheduler? SSD/NVMe: none or mq-deadline. HDD: mq-deadline or bfq. Databases: mq-deadline for balanced throughput and latency.

Q4: Does perf sampling affect production performance? Minimal impact; keep frequency ≤ 99 Hz (overhead < 1 % CPU) and sample for 30–60 seconds.

Q5: How to optimise excessive TIME‑WAIT connections? Enable port reuse: sudo sysctl -w net.ipv4.tcp_tw_reuse=1.

Q6: How to measure fault‑handling effectiveness? MTTD < 2 min, MTTR < 15 min, impact < 5 % of users, zero repeat incidents.

Linux sysadmin

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.