Mastering Linux Performance: From CPU to Flame Graphs
This article presents a comprehensive guide to Linux performance analysis, covering background, methodology, tools, and step‑by‑step case studies for CPU, memory, disk I/O, network, system load, and flame‑graph techniques to quickly locate and resolve bottlenecks.
Background
When monitoring plugins cannot immediately reveal the root cause of issues, deeper server‑side analysis is required; this demands technical experience and a broad knowledge base to pinpoint problems efficiently.
Explanation
The article introduces various problem‑location tools and demonstrates their use with real‑world cases.
Problem‑analysis Methodology
Applying the 5W2H method helps formulate performance‑analysis questions:
What – what is the phenomenon?
When – when does it occur?
Why – why does it happen?
Where – where does it happen?
How much – how many resources are consumed?
How to do – how to solve it?
CPU
Explanation
For applications, we usually focus on kernel CPU scheduler functionality and performance.
Thread‑state analysis classifies time spent as:
on‑CPU: execution time, divided into user‑mode (user) and kernel‑mode (sys).
off‑CPU: waiting for the next CPU slice, I/O, locks, paging, etc., with sub‑states such as runnable, anonymous page, sleep, lock, idle.
If most time is on‑CPU, CPU profiling can quickly explain the cause; if time is off‑CPU, diagnosis becomes more time‑consuming.
Processor
Core
Hardware thread
CPU cache
Clock frequency
CPI / IPC
CPU instructions
Utilization
User / kernel time
Scheduler
Run queue
Preemption
Multi‑process
Multi‑thread
Word length
Analysis Tools
Typical tools:
uptime, vmstat, mpstat, top, pidstat – monitor CPU and load.
perf – trace per‑function CPU usage, can target kernel functions.
Usage
<code>// View system CPU usage with top</code><code>// Show per‑CPU info: mpstat -P ALL 1</code><code>// Show CPU usage and average load: vmstat 1</code><code>// Process CPU stats: pidstat -u 1 -p <pid></code><code>// Trace process functions: perf top -p <pid> -e cpu-clock</code>Memory
Explanation
Memory issues affect not only performance but also service availability; key concepts include:
Main memory
Virtual memory
Resident memory
Address space
OOM
Page cache
Page faults
Swapping
Allocators (libc, glibc, jemalloc, tcmalloc)
Linux SLUB allocator
Analysis Tools
Common utilities:
free, vmstat, top, pidstat, pmap – show memory usage.
valgrind – detect memory leaks.
dtrace – dynamic tracing of kernel functions via D scripts.
Usage
<code>// Show system memory usage: free -m</code><code>// Virtual memory stats: vmstat 1</code><code>// System memory view: top</code><code>// Per‑process memory stats: pidstat -p <pid> -r 1</code><code>// Process memory map: pmap -d <pid></code><code>// Detect memory leaks: valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./program</code>Disk I/O
Explanation
Disk is the slowest subsystem and a common performance bottleneck due to mechanical latency; understanding basic concepts such as file system, VFS, caches, inode, and I/O schedulers is essential.
Analysis Tools
Typical tools include iostat, iotop, and related utilities.
Usage
<code>// View I/O stats: iotop</code><code>// Detailed I/O: iostat -d -x -k 1 10</code><code>// Process‑level I/O: pidstat -d 1 -p <pid></code><code>// Block request tracing: perf record -e block:block_rq_issue -ag</code><code>// Report block traces: perf report</code>Network
Explanation
Network monitoring is complex due to latency, blocking, collisions, packet loss, and external equipment influences; adaptive NICs adjust to varying network conditions.
Analysis Tools
Common commands: netstat, ss, sar, tcpdump, tcpflow.
Usage
<code>// Show network stats: netstat -s</code><code>// Show UDP connections: netstat -nu</code><code>// Show UDP port usage: netstat -apu</code><code>// Count connections per state: netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'</code><code>// Show sockets summary: ss -s</code><code>// Show all UDP sockets: ss -u -a</code><code>// Show all TCP sockets: ss -t -a</code><code>// TCP connection stats: sar -n TCP,ETCP 1</code><code>// Network I/O stats: sar -n DEV 1</code><code>// Packet capture: tcpdump -i eth1 host 192.168.1.1 and port 80</code><code>// Flow capture: tcpflow -cp host 192.168.1.1</code>System Load
Explanation
Load measures the amount of work the system is doing; Load Average is the average over 1, 5, and 15 minutes, representing the length of the run‑queue.
Analysis Tools
Typical tools: uptime, top, vmstat, strace, dmesg.
Usage
<code>// View load: uptime</code><code>// System overview: top</code><code>// VM statistics: vmstat</code><code>// System call timing: strace -c -p <pid></code><code>// Trace specific syscall (e.g., epoll_wait): strace -T -e epoll_wait -p <pid></code><code>// Kernel log: dmesg</code>Flame Graphs
Explanation
Flame graphs visualize CPU call stacks; the y‑axis represents stack depth, the x‑axis represents sample count. Wider bars indicate functions that consume more CPU time.
Installation
<code>// Install systemtap and runtime</code><code>// Install matching kernel debug/devel packages</code><code>// Clone repository: git clone https://github.com/lidaohang/quick_location.git && cd quick_location</code>On‑CPU Flame Graph
Generate and view on‑CPU flame graphs to locate hot functions.
<code>// on‑CPU: sh ngx_on_cpu_u.sh <pid></code><code>// cd ngx_on_cpu_u</code><code>// Serve SVG: python -m SimpleHTTPServer 8088</code><code>// Open http://127.0.0.1:8088/<pid>.svg</code>Off‑CPU Flame Graph
<code>// off‑CPU: sh ngx_off_cpu_u.sh <pid></code><code>// cd ngx_off_cpu_u</code><code>// Serve SVG: python -m SimpleHTTPServer 8088</code><code>// Open http://127.0.0.1:8088/<pid>.svg</code>Memory‑Level Flame Graph
<code>// Generate memory flame graph: sh ngx_on_memory.sh <pid></code><code>// cd ngx_on_memory</code><code>// Serve SVG: python -m SimpleHTTPServer 8088</code>Differential Flame Graph (Red‑Blue)
Compare two profiles to highlight performance regressions.
<code>// Capture before change: perf record -F 99 -p <pid> -g -- sleep 30 && perf script > out.stacks1</code><code>// Capture after change: perf record -F 99 -p <pid> -g -- sleep 30 && perf script > out.stacks2</code><code>// Collapse stacks</code><code>./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1</code><code>./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2</code><code>// Generate diff flame graph</code><code>./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff.svg</code>Case Study: Nginx Cluster Anomaly
Symptoms
On 2017‑09‑25, the Nginx cluster showed many 499 and 5xx responses, with rising CPU usage.
Analysis Steps
Check request traffic – no spike, traffic actually decreased.
Inspect Nginx response time – increased, possibly due to Nginx or upstream latency.
Examine upstream response time – also increased, suggesting backend delay.
Observe system CPU via top – Nginx workers consume high CPU.
Profile Nginx process with perf top – heavy cost in free, malloc, JSON parsing.
Generate on‑CPU flame graph – identified frequent JSON parsing as CPU hotspot.
Conclusion
The high CPU usage stemmed from an inefficient JSON parsing module within Nginx; disabling the module reduced CPU load and normalized request traffic.
References
http://www.brendangregg.com/
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html
http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html
https://github.com/openresty/openresty-systemtap-toolkit
https://github.com/brendangregg/FlameGraph
https://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.