Mastering Linux Performance: From CPU to Flame Graphs and Real‑World Case Studies
This comprehensive guide explains how to diagnose Linux performance issues using systematic 5W2H analysis, essential monitoring tools for CPU, memory, disk I/O, network, and flame‑graph visualizations, and demonstrates the methodology with a detailed nginx case study to quickly locate bottlenecks.
Background
When monitoring plugins cannot immediately reveal the root cause of a problem, logging into the server for deeper analysis is required. Effective analysis demands technical experience and a broad knowledge base, and good tools can dramatically speed up troubleshooting.
Purpose
This article introduces various problem‑location tools and illustrates their use with real‑world examples.
Problem‑analysis Methodology (5W2H)
What – what is the phenomenon?
When – when does it occur?
Why – why does it happen?
Where – where does it happen?
How much – how many resources are consumed?
How to do – how to solve it?
CPU
Explanation
For applications we usually focus on the kernel CPU scheduler and its performance. Thread‑state analysis distinguishes on‑CPU (user and system time) and off‑CPU (waiting for I/O, locks, paging, etc.).
Key Concepts
Processor
Core
Hardware thread
CPU cache
Clock frequency
CPI / IPC
Instructions
Utilization
User time / kernel time
Scheduler
Run queue
Preemption
Multi‑process / multi‑thread
Word size
Analysis Tools
uptime, vmstat, mpstat, top, pidstat can show CPU usage and load. perf can trace function‑level time and kernel functions.
Usage
// view system CPU usage
top
// view per‑CPU information
mpstat -P ALL 1
// view CPU usage and average load
vmstat 1
// process‑level CPU statistics
pidstat -u 1 -p <pid>
// trace function‑level CPU usage in a process
perf top -p <pid> -e cpu-clockMemory
Explanation
Memory issues affect not only performance but also service availability. Important concepts include main memory, virtual memory, resident set, address space, OOM, page cache, page faults, swapping, and the Linux SLUB allocator.
Analysis Tools
free, vmstat, top, pidstat, pmap show memory usage; valgrind detects leaks; dtrace can trace kernel functions.
Usage
// view system memory usage
free -m
// view virtual memory statistics
vmstat 1
// view system memory details
top
// per‑process memory statistics
pidstat -p <pid> -r 1
// view process memory map
pmap -d <pid>
// detect memory leaks
valgrind --tool=memcheck --leak-check=full --log-file=log.txt ./programDisk I/O
Explanation
Disk is the slowest subsystem and a common performance bottleneck. Understanding file systems, VFS, page cache, buffer cache, inode cache, and I/O schedulers is essential.
Analysis Tools
iotop, iostat, pidstat, perf record (block events).
Usage
// monitor I/O in real time
iotop
// detailed I/O statistics
iostat -d -x -k 1 10
// per‑process I/O statistics
pidstat -d 1 -p <pid>
// trace block I/O events
perf record -e block:block_rq_issue -a
perf reportNetwork
Explanation
Network monitoring is complex due to latency, blocking, collisions, packet loss, and interactions with routers, switches, and wireless signals.
Analysis Tools
netstat, ss, sar, tcpdump, tcpflow.
Usage
// show network statistics
netstat -s
// show current UDP connections
netstat -nu
// show UDP port usage
netstat -apu
// count connections per state
netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
// show TCP connections
ss -t -a
// show socket summary
ss -s
// show all UDP sockets
ss -u -a
// show TCP/ETCP statistics
sar -n TCP,ETCP 1
// show network I/O statistics
sar -n DEV 1
// capture packets to a host and port
tcpdump -i eth1 host 192.168.1.1 and port 80
// capture and display packet contents as a flow
tcpflow -cp host 192.168.1.1System Load
Load measures how much work the system is doing; Load Average is the average over 1, 5, and 15 minutes.
Analysis Tools
// view load
uptime
// interactive view
top
// view system statistics
vmstat
// trace system call latency
strace -c -p <pid>
// trace specific syscalls (e.g., epoll_wait)
strace -T -e epoll_wait -p <pid>
// view kernel logs
dmesgFlame Graphs
Explanation
Flame graphs (by Brendan Gregg) visualize CPU call stacks. The y‑axis shows stack depth, the x‑axis shows sample count (not time). Wide boxes indicate functions that consume more CPU.
Installation
# install systemtap (if not already present)
yum install systemtap systemtap-runtime
# install kernel debug packages matching the running kernel
kernel-debuginfo-<kernel-version>.rpm
kernel-devel-<kernel-version>.rpm
kernel-debuginfo-common-<kernel-version>.rpm
# enable debuginfo repo and install
debuginfo-install --enablerepo=debuginfo search kernel
debuginfo-install --enablerepo=debuginfo search glibcUsage
Clone the flame‑graph repository and generate graphs for a target process:
git clone https://github.com/lidaohang/quick_location.git
cd quick_locationCPU‑level Flame Graphs
On‑CPU flame graphs show where the CPU spends time in user or kernel mode. Off‑CPU graphs show where threads are waiting.
On‑CPU Example
// generate user‑mode on‑CPU flame graph
sh ngx_on_cpu_u.sh <pid>
cd ngx_on_cpu_u
python -m SimpleHTTPServer 8088
# open http://127.0.0.1:8088/<pid>.svgOff‑CPU Example
// generate off‑CPU flame graph
sh ngx_off_cpu_u.sh <pid>
cd ngx_off_cpu_u
python -m SimpleHTTPServer 8088
# open http://127.0.0.1:8088/<pid>.svgMemory‑level Flame Graphs
// generate memory flame graph
sh ngx_on_memory.sh <pid>
cd ngx_on_memory
python -m SimpleHTTPServer 8088
# open http://127.0.0.1:8088/<pid>.svgDiff Flame Graphs (Red‑Blue)
Capture two profiles (before and after a change) and generate a differential flame graph to highlight regressions (red) and improvements (blue).
# profile before change
perf record -F 99 -p <pid> -g -- sleep 30
perf script > out.stacks1
# profile after change
perf record -F 99 -p <pid> -g -- sleep 30
perf script > out.stacks2
# collapse and diff
./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1
./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2
./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff.svgCase Study – Nginx Cluster Issue
Problem
On 2017‑09‑25 the Nginx cluster showed many 499 and 5xx responses, and CPU usage spiked.
Analysis Steps
Check request traffic – traffic was actually decreasing, so the spike was not due to load.
Analyze Nginx response time – response time increased, possibly due to Nginx itself or upstream latency.
Analyze upstream response time – upstream latency grew, suggesting backend delay.
Inspect system CPU – top showed high Nginx worker CPU usage.
Profile Nginx process – perf top -p revealed most time spent in JSON parsing and memory allocation.
Generate on‑CPU flame graph – identified the JSON library as the hotspot.
Conclusion
The root cause was an inefficient JSON parser consuming excessive CPU; the upstream delay was a symptom, not the cause. Disabling the problematic module reduced CPU usage and restored normal traffic.
Reference
Original article (copyright belongs to the author).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
