Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis
This article presents a comprehensive guide to Linux performance analysis, covering CPU, memory, disk I/O, network, system load, flame‑graph techniques, and a real‑world Nginx case study, enabling engineers to quickly locate and resolve bottlenecks.
1. Background
Sometimes we encounter difficult problems that monitoring plugins cannot immediately reveal; deep analysis on the server is required. Accumulated technical experience and broad knowledge are needed to locate issues, and good analysis tools can greatly improve efficiency.
2. Description
This article introduces various problem‑location tools and combines case studies for analysis.
3. Problem‑analysis methodology
Applying the 5W2H method raises several performance‑analysis questions:
What – what does the phenomenon look like
When – when does it happen
Why – why does it happen
Where – where does it happen
How much – how many resources are consumed
How to do – how to solve it
4. CPU
4.1 Description
For applications we usually focus on kernel CPU scheduler functionality and performance. Thread‑state analysis distinguishes on‑CPU (user and sys time) and off‑CPU (waiting for I/O, lock, paging, etc.).
If most time is spent on‑CPU, CPU profiling quickly explains the cause; if most time is off‑CPU, locating the problem takes longer.
Processor
Core
Hardware thread
CPU cache
Clock frequency
CPI / IPC
Instruction set
Utilization
User time / kernel time
Scheduler
Run queue
Preemption
Multi‑process
Multi‑thread
Word size
4.2 Analysis tools
uptime, vmstat, mpstat, top, pidstat – show CPU and load usage.
perf – traces function‑level CPU time and can target specific kernel functions.
4.3 Usage
<code>// view system CPU usage
top
// view per‑CPU info
mpstat -P ALL 1
// view CPU usage and load average
vmstat 1
// process CPU statistics
pidstat -u 1 -p <pid>
# trace function‑level CPU usage of a process
perf top -p <pid> -e cpu-clock</code>5. Memory
5.1 Description
Memory problems affect not only performance but also service availability. Key concepts include:
Main memory
Virtual memory
Resident memory
Address space
OOM
Page cache
Page fault
Swapping
Swap space
Allocator libraries (libc, glibc, libmalloc, mtmalloc)
Linux SLUB allocator
5.2 Analysis tools
free, vmstat, top, pidstat, pmap – report memory usage.
valgrind – detects memory leaks.
dtrace – dynamic tracing of kernel functions (requires D language scripts).
5.3 Usage
<code>// view system memory usage
free -m
// view virtual memory stats
vmstat 1
// view memory usage
top
// per‑process memory stats
pidstat -p <pid> -r 1
// view process memory map
pmap -d <pid>
# detect memory leaks
valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./program</code>6. Disk I/O
6.1 Description
Disk is the slowest subsystem and a common performance bottleneck. Understanding file system, VFS, page cache, buffer cache, inode, and I/O scheduling is essential.
6.2 Analysis tools
6.3 Usage
<code>// view I/O statistics
iotop
// detailed I/O stats
iostat -d -x -k 1 10
// per‑process I/O
pidstat -d 1 -p <pid>
# investigate I/O anomalies
perf record -e block:block_rq_issue -ag
perf report</code>7. Network
7.1 Description
Network monitoring is complex due to latency, blocking, collisions, packet loss, and external devices such as routers and switches.
7.2 Analysis tools
7.3 Usage
<code>// network statistics
netstat -s
// UDP connections
netstat -nu
// UDP port usage
netstat -apu
// count connections per state
netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
// TCP connections
ss -t -a
// socket summary
ss -s
// UDP sockets
ss -u -a
// TCP/ETCP stats
sar -n TCP,ETCP 1
// network I/O
sar -n DEV 1
// packet capture
tcpdump -i eth1 host 192.168.1.1 and port 80
// flow capture
tcpflow -c p host 192.168.1.1</code>8. System Load
8.1 Description
Load measures how much work the system is doing; Load Average is the average length of the run queue over 1, 5, and 15 minutes.
8.2 Analysis tools
8.3 Usage
<code>// view load
uptime
top
vmstat
// system call latency
strace -c -p <pid>
// trace specific syscalls
strace -T -e epoll_wait -p <pid>
// view kernel logs
dmesg</code>9. Flame Graphs
9.1 Description
Flame Graphs, created by Brendan Gregg, visualize CPU call stacks. The y‑axis shows stack depth, the x‑axis shows sample count; wider blocks indicate functions that consume more CPU time.
9.2 Install dependencies
<code>// install systemtap
yum install systemtap systemtap-runtime
// install kernel debug packages matching the running kernel
uname -r
# then install corresponding -debuginfo and -devel packages</code>9.3 Install
<code>git clone https://github.com/lidaohang/quick_location.git
cd quick_location</code>9.4 On‑CPU flame graph
High CPU usage can be pinpointed to the functions that dominate the on‑CPU flame graph.
<code>// on‑CPU user mode
sh ngx_on_cpu_u.sh <pid>
cd ngx_on_cpu_u
python -m SimpleHTTPServer 8088
// open http://127.0.0.1:8088/<pid>.svg</code>9.4.1 on‑CPU
CPU time is split into user and kernel.
9.4.2 off‑CPU
Off‑CPU time represents waiting for I/O, locks, paging, etc.
<code>// off‑CPU user mode
sh ngx_off_cpu_u.sh <pid>
cd ngx_off_cpu_u
python -m SimpleHTTPServer 8088</code>9.5 Memory‑level flame graph
Useful for locating memory‑leak hotspots.
<code>sh ngx_on_memory.sh <pid>
cd ngx_on_memory
python -m SimpleHTTPServer 8088</code>9.6 Differential (red‑blue) flame graph
Compares two profiles to highlight performance regressions.
<code>// capture before change
perf record -F 99 -p <pid> -g -- sleep 30 > out1
// capture after change
perf record -F 99 -p <pid> -g -- sleep 30 > out2
// generate diff
./FlameGraph/stackcollapse-perf.pl out1 > folded1
./FlameGraph/stackcollapse-perf.pl out2 > folded2
./FlameGraph/difffolded.pl folded1 folded2 | ./FlameGraph/flamegraph.pl > diff.svg</code>10. Case Study – Nginx Cluster Anomaly
10.1 Symptom
On 2017‑09‑25 the Nginx cluster showed many 499/5xx responses and increased CPU usage.
10.2 Nginx metrics analysis
Traffic did not spike; response time increased, likely due to upstream latency.
10.3 System CPU analysis
Top showed high CPU usage by Nginx workers; perf top revealed most time spent in free, malloc, and JSON parsing.
10.4 Flame‑graph analysis
On‑CPU flame graph confirmed heavy JSON parsing cost.
10.5 Summary
Root causes: upstream latency and inefficient JSON parsing in Nginx modules. Disabling the costly module reduced CPU usage and restored normal traffic.
11. References
http://www.brendangregg.com/
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html
http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html
https://github.com/openresty/openresty-systemtap-toolkit
https://github.com/brendangregg/FlameGraph
https://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.