Master Linux Performance: CPU, Memory, IO, and Flame Graphs for Nginx Troubleshooting
This guide explains how to diagnose Linux performance bottlenecks—CPU, memory, disk I/O, network, and system load—using tools such as top, vmstat, perf, and flame graphs, and demonstrates a real‑world Nginx case study to pinpoint high‑CPU JSON parsing and upstream latency issues.
1. Background
Sometimes complex issues arise that monitoring plugins cannot immediately reveal; deep analysis on the server is required, demanding technical experience across many domains to locate the root cause.
Effective analysis tools can greatly accelerate problem identification, saving time for deeper work.
2. Overview
This article introduces various troubleshooting tools and demonstrates their use with case studies.
3. Problem‑analysis Methodology
Applying the 5W2H framework yields key performance questions:
What : What is the observed phenomenon?
When : When does it occur?
Why : Why does it happen?
Where : Where does it happen?
How much : How many resources are consumed?
How to do : How can it be resolved?
4. CPU
4.1 Overview
For applications, the kernel CPU scheduler’s functionality and performance are primary concerns. Thread‑state analysis distinguishes on‑CPU (user and system time) from off‑CPU (waiting for I/O, locks, paging, etc.).
Heavy on‑CPU time indicates a need for CPU profiling; extensive off‑CPU time suggests bottlenecks elsewhere.
4.2 Tools
uptime, vmstat, mpstat, top, pidstat – basic CPU and load metrics.
perf – detailed per‑function CPU usage, can target kernel functions.
4.3 Usage
<code>// View overall CPU usage
top
// Show per‑CPU statistics
mpstat -P ALL 1
// Show load averages
vmstat 1
// Process‑level CPU stats
pidstat -u 1 -p <pid>
# Profile a process with perf
perf top -p <pid> -e cpu-clock
</code>5. Memory
5.1 Overview
Memory issues can affect performance, service availability, or cause crashes. Key concepts include:
Main memory
Virtual memory
Resident memory
Address space
OOM (Out‑of‑Memory)
Page cache
Page faults and swapping
Allocators (libc, glibc, jemalloc, etc.)
Linux kernel SLUB allocator
5.2 Tools
free, vmstat, top, pidstat, pmap – memory usage statistics.
valgrind – memory leak detection.
dtrace – dynamic tracing of kernel functions (requires deep kernel knowledge).
5.3 Usage
<code>// Show system memory usage
free -m
// Virtual memory stats
vmstat 1
// Process memory map
pmap -d <pid>
# Detect leaks with valgrind
valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./program
</code>6. Disk I/O
6.1 Overview
Disk subsystems are often the slowest component, introducing performance bottlenecks due to mechanical latency. Understanding file systems, VFS, page cache, buffer cache, inode structures, and I/O schedulers is essential.
6.2 Tools
6.3 Usage
<code>// Real‑time I/O monitoring
iotop
// Detailed I/O stats
iostat -d -x -k 1 10
// Process‑level I/O
pidstat -d 1 -p <pid>
# Block‑level tracing with perf
perf record -e block:block_rq_issue -ag
perf report
</code>7. Network
7.1 Overview
Network monitoring is complex due to latency, congestion, packet loss, and interactions with routers, switches, and wireless signals. Adaptive NICs adjust to varying link speeds and modes.
7.2 Tools
7.3 Usage
<code>// Network statistics
netstat -s
// UDP connections
netstat -nu
// UDP port usage
netstat -apu
// Count connections by state
netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
// TCP sockets
ss -t -a
// Socket summary
ss -s
// UDP sockets
ss -u -a
// TCP/ETCP stats
sar -n TCP,ETCP 1
// Network I/O
sar -n DEV 1
// Packet capture
tcpdump -i eth1 host 192.168.1.1 and port 80
// Flow capture
tcpflow -c -p host 192.168.1.1
</code>8. System Load
8.1 Overview
Load measures the amount of work a system is doing, expressed as the average number of processes in the run queue over 1, 5, and 15‑minute intervals.
8.2 Tools
8.3 Usage
<code>// Load and uptime
uptime
top
vmstat
// System call latency
strace -c -p <pid>
# Trace specific syscalls
strace -T -e epoll_wait -p <pid>
// Kernel messages
dmesg
</code>9. Flame Graphs
9.1 Overview
Flame graphs visualize CPU call stacks; the y‑axis shows stack depth, the x‑axis represents sample counts. Wider blocks indicate functions consuming more CPU time. Variants include on‑CPU, off‑CPU, memory, and differential flame graphs.
9.2 Installing Dependencies
<code># Install systemtap
yum install systemtap systemtap-runtime
# Install matching kernel debug packages
yum install kernel-debuginfo-$(uname -r) kernel-devel-$(uname -r) kernel-debuginfo-common-$(uname -r)
# Install additional debug info
debuginfo-install --enablerepo=debuginfo glibc kernel
</code>9.3 Getting the Toolkit
<code>git clone https://github.com/lidaohang/quick_location.git
cd quick_location
</code>9.4 CPU‑level Flame Graphs
9.4.1 On‑CPU
Generate and view an on‑CPU flame graph for a process:
<code>// Record user‑space CPU usage
sh ngx_on_cpu_u.sh <pid>
cd ngx_on_cpu_u
# Serve the SVG
python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/<pid>.svg
</code>9.4.2 Off‑CPU
Generate an off‑CPU flame graph to locate waiting time:
<code>// Record off‑CPU time
sh ngx_off_cpu_u.sh <pid>
cd ngx_off_cpu_u
python -m SimpleHTTPServer 8088
</code>9.5 Memory‑level Flame Graphs
Use the provided script to capture memory‑related flame graphs:
<code>sh ngx_on_memory.sh <pid>
cd ngx_on_memory
python -m SimpleHTTPServer 8088
</code>9.6 Differential (Red‑Blue) Flame Graphs
Compare two profiling runs to highlight performance regressions:
<code># Capture baseline
perf record -F 99 -p <pid> -g -- sleep 30 > out1.stacks
# Capture after changes
perf record -F 99 -p <pid> -g -- sleep 30 > out2.stacks
# Collapse and diff
./FlameGraph/stackcollapse-perf.pl out1.stacks > out1.folded
./FlameGraph/stackcollapse-perf.pl out2.stacks > out2.folded
./FlameGraph/difffolded.pl out1.folded out2.folded | ./FlameGraph/flamegraph.pl > diff.svg
</code>10. Case Study: Nginx Cluster Anomaly
10.1 Symptoms
On 2017‑09‑25, monitoring showed a surge of 499 and 5xx responses from an Nginx cluster, accompanied by rising CPU usage.
10.2 Nginx Metrics Analysis
Request traffic had actually decreased, so the spike was not traffic‑related.
Response times increased, possibly due to Nginx itself or upstream latency.
Upstream response times grew, suggesting backend delays affecting Nginx.
10.3 System CPU Investigation
Top revealed high CPU consumption by Nginx workers.
perf top identified hotspots in free, malloc, and JSON parsing.
10.4 Flame‑Graph Insight
On‑CPU flame graph confirmed intensive JSON parsing by a low‑performance library.
10.5 Summary
Upstream latency contributed to request anomalies.
Internal Nginx modules, especially JSON parsing and memory allocation, caused high CPU usage.
Disabling the costly module reduced CPU load and normalized traffic.
11. References
http://www.brendangregg.com/index.html
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html
http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html
https://github.com/openresty/openresty-systemtap-toolkit
https://github.com/brendangregg/FlameGraph
https://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.