Operations 22 min read

Essential Linux Performance Troubleshooting Cheat Sheet: From CPU to Network

This guide provides a systematic Linux performance troubleshooting cheat sheet covering CPU, memory, disk I/O, network, processes, system calls, logs, and kernel parameters, complete with over 20 practical commands, real‑world case studies, best‑practice checklists, and an FAQ to help ops engineers quickly pinpoint and resolve performance bottlenecks.

MaGe Linux Operations

Oct 16, 2025

Essential Linux Performance Troubleshooting Cheat Sheet: From CPU to Network

Introduction

In production environments, performance problems can appear suddenly and cause severe impact such as slow responses, user complaints, and high server load. Operators must locate the root cause quickly—whether it is CPU saturation, memory leaks, disk I/O bottlenecks, or network congestion.

This article offers a systematic Linux performance troubleshooting cheat sheet covering CPU, memory, disk, network, and process dimensions, summarizing more than 20 practical commands and tools. Each tool includes usage scenarios, key parameters, output interpretation, and optimization guidance to help you move from symptoms to root cause.

Technical Background

Linux Performance Metric Models

Linux performance observation follows the USE (Utilization, Saturation, Errors) and RED (Rate, Errors, Duration) methodologies:

CPU dimension : utilization, context switches, run queue, CPU cache hit rate

Memory dimension : physical memory usage, Page Cache, Swap, memory pressure, OOM events

Disk I/O dimension : IOPS, throughput, queue depth, average response time

Network dimension : bandwidth utilization, packet loss, retransmission rate, connection state distribution

Process dimension : process state, file descriptor usage, system call tracing

Core Content

1. CPU Performance Diagnosis

Quick load view

# 查看系统负载（1/5/15分钟平均值）
uptime
# 输出示例: load average: 2.15, 1.98, 1.75

# 实时监控（按1键切换CPU视图）
top
# 关键指标: %Cpu(s): us user | sy system | id idle | wa I/O wait | st steal

Per‑core statistics

# 安装 sysstat
sudo apt install sysstat -y   # Ubuntu/Debian
sudo yum install sysstat -y   # RHEL/CentOS

# 每2秒显示所有CPU核心
mpstat -P ALL 2
# 示例输出: CPU 0 %usr=78%, CPU 1 %usr=32.5%

Performance event sampling

# 采样30秒CPU事件（需 root）
sudo perf record -F 99 -a -g -- sleep 30

# 查看报告
sudo perf report --stdio

# 生成火焰图
sudo perf script | \
  FlameGraph/stackcollapse-perf.pl | \
  FlameGraph/flamegraph.pl > flame.svg

Key alert thresholds

Load Average > number of CPUs → overload

%iowait > 20% → disk I/O bottleneck

%sy > 30% → possible kernel module issue

2. Memory Management Diagnosis

Memory overview

# 显示内存统计（-h 人类可读）
free -h
# 关注 available（真实可用内存）

Process‑level memory details

# 按内存使用排序进程
ps aux --sort=-%mem | head -20

# 查看特定进程内存映射（替换 <PID>）
sudo cat /proc/<PID>/smaps_rollup
# 关键字段: Rss (实际物理内存), Pss (共享库按比例分摊), Private_Dirty (独占已修改内存)

Memory pressure monitoring

# 实时监控内存与 Swap 使用
vmstat 3
# 关键字段: si/so (Swap In/Out), r (运行队列 > 2×CPU cores 表示 CPU 饱和)

Alert conditions

available < 10% 且 Swap used > 50% → memory pressure

si/so 持续 > 100 pages/s → need memory expansion

dmesg 出现 "OOM Killer" → severe memory shortage

3. Disk I/O Performance

Disk I/O statistics

# 每2秒刷新扩展统计
iostat -xz 2
# 关键指标: %util (设备利用率 >80% 表示饱和), await (平均响应时间), r/s+w/s (IOPS), rrqm/s (读请求合并率)

Process‑level I/O monitoring

# 显示哪个进程在进行 I/O（按 I/O 排序）
sudo iotop -o
# 输出示例: 各进程的读写速率 (KB/s) 与 I/O 百分比

Disk benchmark

# 安装 fio
sudo apt install fio -y

# 顺序写测试（在 /tmp/lab，避免破坏生产）
sudo mkdir -p /tmp/lab
sudo fio --name=seqwrite --rw=write --bs=128k --size=1G \
  --numjobs=1 --runtime=60 --directory=/tmp/lab

# 测试后清理
sudo rm -rf /tmp/lab

Performance baselines

SSD 随机读 >100k IOPS, latency <1ms

HDD 顺序读 >500 IOPS, latency <10ms

%util 持续 >80% → need capacity or optimization

4. Network Performance Diagnosis

Connection state overview

# 统计各状态 TCP 连接数
ss -tan | awk '{print $1}' | sort | uniq -c
# 示例: 45 ESTAB, 12 TIME-WAIT, 3 LISTEN
# 告警: TIME-WAIT > 1000 或 CLOSE-WAIT 过多

Traffic monitoring

# 安装 iftop
sudo apt install iftop -y
# 实时监控 eth0 流量
sudo iftop -i eth0
# 显示每个连接的发送/接收速率

Packet capture

# 抓取 80 端口流量（保存为 pcap）
sudo tcpdump -i eth0 port 80 -w /tmp/http.pcap -c 1000
# 查看前 50 条记录
tcpdump -r /tmp/http.pcap -nn | head -50
# 高级过滤示例: 抓取特定 IP 的 SYN 包
sudo tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0 and src host 192.168.1.100'

NIC hardware statistics

# 查看网卡统计（丢包、错误、冲突）
ethtool -S eth0 | grep -E 'drop|error|coll'
# 查看协商速率
ethtool eth0 | grep Speed
# 确保 1000Mb/s 或更高

5. Process and System Call Analysis

Process tree and state

# 显示进程层级关系
pstree -p | grep <process_name>
# 查看僵尸进程
ps aux | awk '$8 ~ /Z/ {print}'
# 查看进程文件描述符占用
ls -la /proc/<PID>/fd | wc -l

System call tracing

# 追踪运行中的进程系统调用（替换 <PID>）
sudo strace -p <PID> -T -tt -e trace=open,read,write,close -o /tmp/strace.log
# 启动程序并追踪
strace -T -tt curl https://example.com 2>&1 | head -100
# 参数说明: -T 显示耗时, -tt 微秒级时间戳, -e trace 指定系统调用

Library call tracing

# 追踪动态库函数调用
sudo ltrace -p <PID> -o /tmp/ltrace.log
# 适用于排查第三方库问题

6. System Log Analysis

Kernel logs

# 查看最近内核消息（含硬件错误、OOM 事件）
sudo dmesg -T | tail -200
# 搜索 OOM Killer 事件
sudo dmesg -T | grep -i 'killed process'
# systemd 系统使用 journalctl
sudo journalctl -k -p err -n 100

Application log locations

Ubuntu/Debian: /var/log/syslog, /var/log/auth.log RHEL/CentOS: /var/log/messages,

/var/log/secure

# 实时监控系统日志
sudo tail -f /var/log/syslog
# 搜索 SSH 登录失败
sudo grep 'Failed password' /var/log/auth.log | tail -50

7. Kernel Parameter Optimization

View and configure

# 显示所有内核参数
sysctl -a | less
# 查看特定参数
sysctl net.core.somaxconn
# 临时修改（重启失效）
sudo sysctl -w net.core.somaxconn=8192
# 永久修改
echo "net.core.somaxconn = 8192" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Common tuning parameters

# TCP 连接队列大小（高并发推荐 ≥4096）
net.core.somaxconn = 8192
# 启用 TIME-WAIT 复用
net.ipv4.tcp_tw_reuse = 1
# 文件描述符限制
fs.file-max = 1048576
# 虚拟内存脏页比例（减少 I/O 阻塞）
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

Backup before changes

sudo cp /etc/sysctl.conf /etc/sysctl.conf.bak

Practical Cases

Case 1: CPU Spike to 100%

Symptom : Server CPU usage jumps to 100%, order processing times out.

Investigation steps

# Step 1: Locate offending process
 top -c   # finds java PID 1234 using 380%

# Step 2: Thread‑level analysis
 top -H -p 1234   # thread 1250 uses 98%

# Step 3: Get thread stack
 printf '%x
' 1250   # converts to 0x4e2
 sudo jstack 1234 | grep -A 20 '0x4e2'   # shows infinite loop in coupon calculation

# Step 4: Performance sampling verification
 sudo perf record -p 1234 -g -- sleep 10
 sudo perf report --stdio | head -50

Optimization

Immediate: restart service, enable rate limiting

Code fix: optimise algorithm, add caching

Capacity planning: vertical scale to 8 cores, add 3 horizontal nodes

Result : CPU average dropped from 95% to 45%; P99 latency reduced from 4800 ms to 220 ms.

Case 2: Memory Leak Detection

Symptom : Service restarts after 3 days, OOM Killer triggered.

Investigation steps

# Confirm OOM event
 sudo dmesg -T | grep -i 'killed process'

# Record memory usage every 5 s
 while true; do
   ps -p 5678 -o pid,vsz,rss,%mem,cmd >> /tmp/mem_monitor.log
   sleep 5
 done

# Analyse trend (RSS grew from 2 GB to 14 GB)
 awk '{print $3}' /tmp/mem_monitor.log | head -100

# Detailed memory map
 sudo cat /proc/5678/smaps_rollup
 # Private_Dirty shows 13 GB → abnormal

# Use valgrind to detect leaks
 valgrind --leak-check=full ./video_service

Fix : Call av_free() after buffer use and set systemd memory limits.

Result : After 30 days the process stays at ~3.5 GB, OOM events eliminated.

Case 3: Disk I/O Bottleneck

Symptom : Database query latency rises from 50 ms to 2 s, slow‑query count spikes.

Investigation steps

# I/O stats
 iostat -xz 2
 # %util 99.8, await 152.3 ms → abnormal

# Locate I/O heavy process
 sudo iotop -o
 # TID 3456 mysqld reads 45.2 M/s

# Analyse slow queries
 sudo cat /var/log/mysql/mysql-slow.log | tail -100
 # Many full table scans, missing indexes

# Block device tracing (10 s sample)
 sudo blktrace -d /dev/sda -o - | blkparse -i - | head -200

Optimization

Add database indexes (ALTER TABLE ADD INDEX)

Move data directory to SSD

Change I/O scheduler: echo mq-deadline > /sys/block/sda/queue/scheduler

Result : Disk %util dropped to 35%; P95 query latency improved from 1800 ms to 55 ms.

Best‑Practice Checklist

Establish performance baseline : collect CPU/memory/I/O metrics during low‑traffic periods.

Layered investigation principle : start with macro view (load/top) then drill down (perf/strace).

Preserve fault evidence : gather data before reboot.

Tool version management : ensure sysstat, perf, iotop are uniformly installed (≥12.x).

Automated inspection : schedule cron jobs to collect sar data, retain 30 days history.

Avoid destructive actions : test in /tmp/lab, never on production mounts.

Log rotation configuration : prevent /var/log from filling up.

Network capture guidelines : limit packet count or duration, obtain approval for sensitive environments.

Kernel parameter change workflow : backup → validate → write → monitor 24 h.

Cross‑validation : combine perf/sar to confirm top results.

Capacity planning foresight : trigger expansion when resource usage >70%.

Alert severity tiers : P0 immediate, P1/P2 graded response.

Document troubleshooting paths : maintain runbooks for common issues.

Least‑privilege principle : grant only necessary sudo rights.

Regular drills : quarterly performance‑failure rehearsals.

Summary and Outlook

This article systematically outlines core Linux performance troubleshooting tools and methodologies, using three complete real‑world cases to help operations engineers quickly locate problems.

Key takeaways :

Follow USE/RED model for layered diagnosis.

Watch critical metric thresholds.

Tailor optimizations to business scenarios.

Keep the toolchain up‑to‑date.

Technical evolution trends :

eBPF adoption: bpftrace replaces traditional tools without kernel changes.

Observability standardization: OpenTelemetry integrates metrics, logs, traces.

AI‑assisted diagnosis: anomaly detection based on historical data.

Container‑native monitoring: cAdvisor for Kubernetes.

FAQ

Q1: How high is a problematic Load Average? Normal: per‑core < 1.0, multi‑core < number of cores (e.g., 4‑core system Load < 4.0). Consider %wa – high %wa indicates I/O bottleneck rather than CPU.

Q2: Does 90% memory usage require scaling? Not necessarily. Check the available field (>10% is healthy) and Swap usage. Persistent growth with non‑zero si/so suggests scaling.

Q3: How to choose an I/O scheduler?

SSD/NVMe: none or mq-deadline HDD: mq-deadline or bfq Database workloads: mq-deadline (balances throughput and latency)

Q4: Does perf sampling affect production performance? Impact is minimal; keep sampling frequency ≤99 Hz (overhead <1% CPU) and sample for 30–60 seconds.

Q5: How to optimise excessive TIME‑WAIT connections? Enable port reuse: sudo sysctl -w net.ipv4.tcp_tw_reuse=1 Q6: How to measure fault‑handling effectiveness? Use MTTD (<2 min), MTTR (<15 min), impact (<5% of users), and zero repeat‑failure rate.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux Troubleshooting

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.