Essential Linux Performance Troubleshooting Cheat Sheet: From CPU to Network
This guide provides a systematic Linux performance troubleshooting cheat sheet covering CPU, memory, disk I/O, network, processes, system calls, logs, and kernel parameters, complete with over 20 practical commands, real‑world case studies, best‑practice checklists, and an FAQ to help ops engineers quickly pinpoint and resolve performance bottlenecks.
Essential Linux Performance Troubleshooting Cheat Sheet: From CPU to Network
Introduction
In production environments, performance problems can appear suddenly and cause severe impact such as slow responses, user complaints, and high server load. Operators must locate the root cause quickly—whether it is CPU saturation, memory leaks, disk I/O bottlenecks, or network congestion.
This article offers a systematic Linux performance troubleshooting cheat sheet covering CPU, memory, disk, network, and process dimensions, summarizing more than 20 practical commands and tools. Each tool includes usage scenarios, key parameters, output interpretation, and optimization guidance to help you move from symptoms to root cause.
Technical Background
Linux Performance Metric Models
Linux performance observation follows the USE (Utilization, Saturation, Errors) and RED (Rate, Errors, Duration) methodologies:
CPU dimension : utilization, context switches, run queue, CPU cache hit rate
Memory dimension : physical memory usage, Page Cache, Swap, memory pressure, OOM events
Disk I/O dimension : IOPS, throughput, queue depth, average response time
Network dimension : bandwidth utilization, packet loss, retransmission rate, connection state distribution
Process dimension : process state, file descriptor usage, system call tracing
Core Content
1. CPU Performance Diagnosis
Quick load view
# 查看系统负载(1/5/15分钟平均值)
uptime
# 输出示例: load average: 2.15, 1.98, 1.75
# 实时监控(按1键切换CPU视图)
top
# 关键指标: %Cpu(s): us user | sy system | id idle | wa I/O wait | st stealPer‑core statistics
# 安装 sysstat
sudo apt install sysstat -y # Ubuntu/Debian
sudo yum install sysstat -y # RHEL/CentOS
# 每2秒显示所有CPU核心
mpstat -P ALL 2
# 示例输出: CPU 0 %usr=78%, CPU 1 %usr=32.5%Performance event sampling
# 采样30秒CPU事件(需 root)
sudo perf record -F 99 -a -g -- sleep 30
# 查看报告
sudo perf report --stdio
# 生成火焰图
sudo perf script | \
FlameGraph/stackcollapse-perf.pl | \
FlameGraph/flamegraph.pl > flame.svgKey alert thresholds
Load Average > number of CPUs → overload
%iowait > 20% → disk I/O bottleneck
%sy > 30% → possible kernel module issue
2. Memory Management Diagnosis
Memory overview
# 显示内存统计(-h 人类可读)
free -h
# 关注 available(真实可用内存)Process‑level memory details
# 按内存使用排序进程
ps aux --sort=-%mem | head -20
# 查看特定进程内存映射(替换 <PID>)
sudo cat /proc/<PID>/smaps_rollup
# 关键字段: Rss (实际物理内存), Pss (共享库按比例分摊), Private_Dirty (独占已修改内存)Memory pressure monitoring
# 实时监控内存与 Swap 使用
vmstat 3
# 关键字段: si/so (Swap In/Out), r (运行队列 > 2×CPU cores 表示 CPU 饱和)Alert conditions
available < 10% 且 Swap used > 50% → memory pressure
si/so 持续 > 100 pages/s → need memory expansion
dmesg 出现 "OOM Killer" → severe memory shortage
3. Disk I/O Performance
Disk I/O statistics
# 每2秒刷新扩展统计
iostat -xz 2
# 关键指标: %util (设备利用率 >80% 表示饱和), await (平均响应时间), r/s+w/s (IOPS), rrqm/s (读请求合并率)Process‑level I/O monitoring
# 显示哪个进程在进行 I/O(按 I/O 排序)
sudo iotop -o
# 输出示例: 各进程的读写速率 (KB/s) 与 I/O 百分比Disk benchmark
# 安装 fio
sudo apt install fio -y
# 顺序写测试(在 /tmp/lab,避免破坏生产)
sudo mkdir -p /tmp/lab
sudo fio --name=seqwrite --rw=write --bs=128k --size=1G \
--numjobs=1 --runtime=60 --directory=/tmp/lab
# 测试后清理
sudo rm -rf /tmp/labPerformance baselines
SSD 随机读 >100k IOPS, latency <1ms
HDD 顺序读 >500 IOPS, latency <10ms
%util 持续 >80% → need capacity or optimization
4. Network Performance Diagnosis
Connection state overview
# 统计各状态 TCP 连接数
ss -tan | awk '{print $1}' | sort | uniq -c
# 示例: 45 ESTAB, 12 TIME-WAIT, 3 LISTEN
# 告警: TIME-WAIT > 1000 或 CLOSE-WAIT 过多Traffic monitoring
# 安装 iftop
sudo apt install iftop -y
# 实时监控 eth0 流量
sudo iftop -i eth0
# 显示每个连接的发送/接收速率Packet capture
# 抓取 80 端口流量(保存为 pcap)
sudo tcpdump -i eth0 port 80 -w /tmp/http.pcap -c 1000
# 查看前 50 条记录
tcpdump -r /tmp/http.pcap -nn | head -50
# 高级过滤示例: 抓取特定 IP 的 SYN 包
sudo tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0 and src host 192.168.1.100'NIC hardware statistics
# 查看网卡统计(丢包、错误、冲突)
ethtool -S eth0 | grep -E 'drop|error|coll'
# 查看协商速率
ethtool eth0 | grep Speed
# 确保 1000Mb/s 或更高5. Process and System Call Analysis
Process tree and state
# 显示进程层级关系
pstree -p | grep <process_name>
# 查看僵尸进程
ps aux | awk '$8 ~ /Z/ {print}'
# 查看进程文件描述符占用
ls -la /proc/<PID>/fd | wc -lSystem call tracing
# 追踪运行中的进程系统调用(替换 <PID>)
sudo strace -p <PID> -T -tt -e trace=open,read,write,close -o /tmp/strace.log
# 启动程序并追踪
strace -T -tt curl https://example.com 2>&1 | head -100
# 参数说明: -T 显示耗时, -tt 微秒级时间戳, -e trace 指定系统调用Library call tracing
# 追踪动态库函数调用
sudo ltrace -p <PID> -o /tmp/ltrace.log
# 适用于排查第三方库问题6. System Log Analysis
Kernel logs
# 查看最近内核消息(含硬件错误、OOM 事件)
sudo dmesg -T | tail -200
# 搜索 OOM Killer 事件
sudo dmesg -T | grep -i 'killed process'
# systemd 系统使用 journalctl
sudo journalctl -k -p err -n 100Application log locations
Ubuntu/Debian: /var/log/syslog, /var/log/auth.log RHEL/CentOS: /var/log/messages,
/var/log/secure # 实时监控系统日志
sudo tail -f /var/log/syslog
# 搜索 SSH 登录失败
sudo grep 'Failed password' /var/log/auth.log | tail -507. Kernel Parameter Optimization
View and configure
# 显示所有内核参数
sysctl -a | less
# 查看特定参数
sysctl net.core.somaxconn
# 临时修改(重启失效)
sudo sysctl -w net.core.somaxconn=8192
# 永久修改
echo "net.core.somaxconn = 8192" | sudo tee -a /etc/sysctl.conf
sudo sysctl -pCommon tuning parameters
# TCP 连接队列大小(高并发推荐 ≥4096)
net.core.somaxconn = 8192
# 启用 TIME-WAIT 复用
net.ipv4.tcp_tw_reuse = 1
# 文件描述符限制
fs.file-max = 1048576
# 虚拟内存脏页比例(减少 I/O 阻塞)
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5Backup before changes
sudo cp /etc/sysctl.conf /etc/sysctl.conf.bakPractical Cases
Case 1: CPU Spike to 100%
Symptom : Server CPU usage jumps to 100%, order processing times out.
Investigation steps
# Step 1: Locate offending process
top -c # finds java PID 1234 using 380%
# Step 2: Thread‑level analysis
top -H -p 1234 # thread 1250 uses 98%
# Step 3: Get thread stack
printf '%x
' 1250 # converts to 0x4e2
sudo jstack 1234 | grep -A 20 '0x4e2' # shows infinite loop in coupon calculation
# Step 4: Performance sampling verification
sudo perf record -p 1234 -g -- sleep 10
sudo perf report --stdio | head -50Optimization
Immediate: restart service, enable rate limiting
Code fix: optimise algorithm, add caching
Capacity planning: vertical scale to 8 cores, add 3 horizontal nodes
Result : CPU average dropped from 95% to 45%; P99 latency reduced from 4800 ms to 220 ms.
Case 2: Memory Leak Detection
Symptom : Service restarts after 3 days, OOM Killer triggered.
Investigation steps
# Confirm OOM event
sudo dmesg -T | grep -i 'killed process'
# Record memory usage every 5 s
while true; do
ps -p 5678 -o pid,vsz,rss,%mem,cmd >> /tmp/mem_monitor.log
sleep 5
done
# Analyse trend (RSS grew from 2 GB to 14 GB)
awk '{print $3}' /tmp/mem_monitor.log | head -100
# Detailed memory map
sudo cat /proc/5678/smaps_rollup
# Private_Dirty shows 13 GB → abnormal
# Use valgrind to detect leaks
valgrind --leak-check=full ./video_serviceFix : Call av_free() after buffer use and set systemd memory limits.
Result : After 30 days the process stays at ~3.5 GB, OOM events eliminated.
Case 3: Disk I/O Bottleneck
Symptom : Database query latency rises from 50 ms to 2 s, slow‑query count spikes.
Investigation steps
# I/O stats
iostat -xz 2
# %util 99.8, await 152.3 ms → abnormal
# Locate I/O heavy process
sudo iotop -o
# TID 3456 mysqld reads 45.2 M/s
# Analyse slow queries
sudo cat /var/log/mysql/mysql-slow.log | tail -100
# Many full table scans, missing indexes
# Block device tracing (10 s sample)
sudo blktrace -d /dev/sda -o - | blkparse -i - | head -200Optimization
Add database indexes (ALTER TABLE ADD INDEX)
Move data directory to SSD
Change I/O scheduler: echo mq-deadline > /sys/block/sda/queue/scheduler
Result : Disk %util dropped to 35%; P95 query latency improved from 1800 ms to 55 ms.
Best‑Practice Checklist
Establish performance baseline : collect CPU/memory/I/O metrics during low‑traffic periods.
Layered investigation principle : start with macro view (load/top) then drill down (perf/strace).
Preserve fault evidence : gather data before reboot.
Tool version management : ensure sysstat, perf, iotop are uniformly installed (≥12.x).
Automated inspection : schedule cron jobs to collect sar data, retain 30 days history.
Avoid destructive actions : test in /tmp/lab, never on production mounts.
Log rotation configuration : prevent /var/log from filling up.
Network capture guidelines : limit packet count or duration, obtain approval for sensitive environments.
Kernel parameter change workflow : backup → validate → write → monitor 24 h.
Cross‑validation : combine perf/sar to confirm top results.
Capacity planning foresight : trigger expansion when resource usage >70%.
Alert severity tiers : P0 immediate, P1/P2 graded response.
Document troubleshooting paths : maintain runbooks for common issues.
Least‑privilege principle : grant only necessary sudo rights.
Regular drills : quarterly performance‑failure rehearsals.
Summary and Outlook
This article systematically outlines core Linux performance troubleshooting tools and methodologies, using three complete real‑world cases to help operations engineers quickly locate problems.
Key takeaways :
Follow USE/RED model for layered diagnosis.
Watch critical metric thresholds.
Tailor optimizations to business scenarios.
Keep the toolchain up‑to‑date.
Technical evolution trends :
eBPF adoption: bpftrace replaces traditional tools without kernel changes.
Observability standardization: OpenTelemetry integrates metrics, logs, traces.
AI‑assisted diagnosis: anomaly detection based on historical data.
Container‑native monitoring: cAdvisor for Kubernetes.
FAQ
Q1: How high is a problematic Load Average? Normal: per‑core < 1.0, multi‑core < number of cores (e.g., 4‑core system Load < 4.0). Consider %wa – high %wa indicates I/O bottleneck rather than CPU.
Q2: Does 90% memory usage require scaling? Not necessarily. Check the available field (>10% is healthy) and Swap usage. Persistent growth with non‑zero si/so suggests scaling.
Q3: How to choose an I/O scheduler?
SSD/NVMe: none or mq-deadline HDD: mq-deadline or bfq Database workloads: mq-deadline (balances throughput and latency)
Q4: Does perf sampling affect production performance? Impact is minimal; keep sampling frequency ≤99 Hz (overhead <1% CPU) and sample for 30–60 seconds.
Q5: How to optimise excessive TIME‑WAIT connections? Enable port reuse: sudo sysctl -w net.ipv4.tcp_tw_reuse=1 Q6: How to measure fault‑handling effectiveness? Use MTTD (<2 min), MTTR (<15 min), impact (<5% of users), and zero repeat‑failure rate.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
