How to Diagnose Linux Performance Issues with Flame Graphs and System Tools
This guide explains how to systematically analyze Linux performance problems—including CPU, memory, disk I/O, network, and load—using 5W2H methodology, built‑in monitoring commands, perf, flame‑graph visualizations, and a real‑world Nginx case study to pinpoint and resolve bottlenecks.
Background
Complex production systems often encounter performance anomalies that are not immediately obvious from monitoring dashboards. Deep analysis on the server is required to locate the root cause, which demands a solid toolbox and systematic methodology.
Methodology
We adopt the 5W2H framework to structure performance investigations:
What – describe the observed symptom.
When – identify the time window when it occurs.
Why – hypothesize possible reasons.
Where – pinpoint the subsystem (CPU, memory, I/O, network, etc.).
How much – quantify resource consumption.
How to do – define concrete steps and tools for diagnosis.
CPU Analysis
Key Concepts
CPU time is split into on‑CPU (user and system execution) and off‑CPU (waiting for I/O, locks, scheduling). Understanding thread states, CPI/IPC, scheduler queues, and CPU cache behavior is essential.
Analysis Tools
uptime, vmstat, mpstat, top, pidstat – provide overall CPU and load metrics.
perf – captures per‑function CPU usage and can target kernel functions.
Typical Commands
//查看系统cpu使用情况 top</code>
<code>//查看所有cpu核信息 mpstat -P ALL 1</code>
<code>//查看cpu使用情况以及平均负载 vmstat 1</code>
<code>//进程cpu的统计信息 pidstat -u 1 -p pid</code>
<code>//跟踪进程内部函数级cpu使用情况 perf top -p pid -e cpu-clockMemory Analysis
Key Concepts
Memory performance involves understanding physical RAM, virtual memory, page cache, OOM events, and allocator behavior (glibc, jemalloc, SLUB, etc.).
Analysis Tools
free, vmstat, top, pidstat, pmap – report overall and per‑process memory usage.
valgrind – detects memory leaks.
dtrace – dynamic tracing of kernel memory functions (requires custom scripts).
Typical Commands
//查看系统内存使用情况 free -m</code>
<code>//虚拟内存统计信息 vmstat 1</code>
<code>//查看系统内存情况 top</code>
<code>//1s采集周期,获取内存的统计信息 pidstat -p pid -r 1</code>
<code>//查看进程的内存映像信息 pmap -d pid</code>
<code>//检测程序内存问题 valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./programDisk I/O Analysis
Key Concepts
Disk subsystems are the slowest component; understanding file systems, VFS, page cache, buffer cache, and inode structures is crucial for I/O bottleneck analysis.
Analysis Tools
iostat – detailed I/O statistics.
iotop – real‑time per‑process I/O usage.
perf – can record block I/O events.
Typical Commands
//查看系统io信息 iotop</code>
<code>//统计io详细信息 iostat -d -x -k 1 10</code>
<code>//查看进程级io的信息 pidstat -d 1 -p pid</code>
<code>//捕获块请求 perf record -e block:block_rq_issue -ag && perf reportNetwork Analysis
Key Concepts
Network performance is affected by latency, packet loss, congestion, and external devices (switches, routers, wireless). Adaptive NICs adjust speed based on link conditions.
Analysis Tools
netstat, ss – display socket statistics.
sar – collect TCP/UDP metrics.
tcpdump, tcpflow – packet capture and flow analysis.
Typical Commands
//显示网络统计信息 netstat -s</code>
<code>//显示当前UDP连接状况 netstat -nu</code>
<code>//统计机器中网络连接各状态 netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'</code>
<code>//显示TCP连接 ss -t -a</code>
<code>//显示sockets摘要信息 ss -s</code>
<code>//抓包 tcpdump -i eth1 host 192.168.1.1 and port 80</code>
<code>//流式抓包 tcpflow -c p host 192.168.1.1System Load
Key Concepts
Load average reflects the length of the runnable process queue over 1, 5, and 15‑minute intervals, providing a high‑level view of system pressure.
Analysis Tools
uptime, top, vmstat – quick load checks.
strace – measures time spent in system calls.
dmesg – kernel log for hardware or scheduler warnings.
Typical Commands
//查看负载情况 uptime top vmstat</code>
<code>//统计系统调用耗时情况 strace -c -p pid</code>
<code>//跟踪特定系统调用 strace -T -e epoll_wait -p pid</code>
<code>//查看内核日志信息 dmesgFlame Graphs
What They Are
Flame graphs visualize stacked call‑stacks; the x‑axis represents aggregated sample counts, the y‑axis represents call depth. Wider boxes indicate functions that consume more CPU time.
Installation
//安装 systemtap
yum install -y systemtap systemtap-runtime</code>
<pre><code>//安装对应内核的调试符号
debuginfo-install --enablerepo=debuginfo kernelGetting the Tools
git clone https://github.com/brendangregg/FlameGraph.gitGenerating On‑CPU Flame Graphs
//on‑CPU usersh ngx_on_cpu_u.sh pid</code>
<code>//进入结果目录 cd ngx_on_cpu_u</code>
<code>//启动临时 HTTP 服务器 python -m SimpleHTTPServer 8088</code>
<code>//在浏览器打开 127.0.0.1:8088/pid.svgGenerating Off‑CPU Flame Graphs
//off‑CPU usersh ngx_off_cpu_u.sh pid</code>
<code>//进入结果目录 cd ngx_off_cpu_u</code>
<code>//启动临时 HTTP 服务器 python -m SimpleHTTPServer 8088</code>
<code>//在浏览器打开 127.0.0.1:8088/pid.svgDifferential (Red‑Blue) Flame Graphs
Capture two profiles (before and after a change), collapse stacks, and run difffolded.pl to highlight regressions (red) and improvements (blue).
cd quick_location</code>
<code>//采集基线 profile 1
perf record -F 99 -p pid -g -- sleep 30 && perf script > out.stacks1</code>
<code>//采集变更后 profile 2
perf record -F 99 -p pid -g -- sleep 30 && perf script > out.stacks2</code>
<code>//生成差分火焰图
./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1
./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2
./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff.svgCase Study – Nginx Cluster Anomaly (Sept 2017)
Observed Symptoms
Spike of 499 and 5xx HTTP status codes.
Sustained high CPU usage on Nginx workers.
Step‑by‑Step Diagnosis
Checked request volume – traffic actually decreased, so load surge was not the cause.
Analyzed Nginx response time – increased latency suggested either Nginx itself or upstream services.
Inspected upstream response time – also increased, indicating a possible backend bottleneck.
Used top to confirm Nginx worker CPU was high.
Ran perf top -p <pid> – identified heavy time spent in free, malloc, and JSON parsing.
Generated on‑CPU flame graph – highlighted the JSON library as the dominant consumer.
Findings
Backend upstream latency contributed to overall request slowdown.
Inside Nginx, a JSON‑parsing module consumed excessive CPU and memory allocations.
Resolution
Temporarily disabled the high‑cost JSON module, which immediately lowered CPU usage and restored normal request rates. The upstream latency remained, but the Nginx‑side bottleneck was eliminated.
References
http://www.brendangregg.com/
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html
http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html
https://github.com/openresty/openresty-systemtap-toolkit
https://github.com/brendangregg/FlameGraph
https://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
