Mastering Linux Performance: From CPU/Memory Profiling to Flame Graphs
This guide explains how to systematically diagnose Linux performance issues using tools such as top, vmstat, perf, and flame graphs, covering CPU, memory, disk I/O, network, and load analysis, and demonstrates a real-world nginx case study with step‑by‑step commands and visualizations.
1. Background
Sometimes difficult problems arise that monitoring plugins cannot immediately pinpoint, requiring server login for deeper analysis. Effective problem analysis demands technical experience and broad domain knowledge to locate root causes.
Having good analysis tools accelerates troubleshooting and saves time for deeper work.
2. Explanation
This article introduces various problem‑location tools and combines them with case studies.
3. Problem‑analysis Methodology
Applying the 5W2H method raises key performance analysis questions:
What – what is the phenomenon?
When – when does it occur?
Why – why does it happen?
Where – where does it happen?
How much – how many resources are consumed?
How to do – how to solve it?
4. CPU
4.1 Explanation
For applications we usually focus on kernel CPU scheduler functionality and performance.
Thread‑state analysis examines where thread time is spent. Thread states are generally classified as:
on‑CPU: executing time, split into user‑mode (user) and kernel‑mode (sys).
off‑CPU: waiting for the next CPU slice, I/O, locks, paging, etc., with sub‑states such as runnable, anonymous paging, sleep, lock, idle.
If most time is on‑CPU, CPU profiling quickly reveals reasons; if most time is off‑CPU, locating the issue can be time‑consuming. Key concepts include processor, core, hardware thread, CPU cache, clock frequency, CPI, IPC, instruction set, utilization, user/kernel time, scheduler, run queue, preemption, multi‑process, multi‑thread, word size, etc.
4.2 Analysis Tools
uptime, vmstat, mpstat, top, pidstat – query CPU and load usage.
perf – trace per‑function CPU time inside processes and can target specific kernel functions.
4.3 Usage
// View system CPU usage
top
// View per‑CPU information
mpstat -P ALL 1
// View CPU usage and average load
vmstat 1
// Process‑level CPU statistics
pidstat -u 1 -p <pid>
// Trace function‑level CPU usage in a process
perf top -p <pid> -e cpu-clock5. Memory
5.1 Explanation
Memory issues affect not only performance but also service availability. Important concepts include main memory, virtual memory, resident memory, address space, OOM, page cache, page faults, swapping, allocators (libc, glibc, libmalloc, mtmalloc), and the kernel SLUB allocator.
5.2 Analysis Tools
free, vmstat, top, pidstat, pmap – report memory usage.
valgrind – detect memory leaks.
dtrace – dynamic tracing of kernel functions via D scripts.
5.3 Usage
// View system memory usage
free -m
// Virtual memory statistics
vmstat 1
// System memory view
top
// Per‑process memory statistics (1 s interval)
pidstat -p <pid> -r 1
// Process memory map
pmap -d <pid>
// Detect memory leaks
valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./program6. Disk I/O
6.1 Explanation
Disk is the slowest subsystem and a common performance bottleneck. Understanding file systems, VFS, caches (page cache, buffer cache, directory cache), inodes, and I/O scheduling strategies is essential.
6.2 Analysis Tools
6.3 Usage
// View system I/O information
iotop
// Detailed I/O statistics
iostat -d -x -k 1 10
// Process‑level I/O information
pidstat -d 1 -p <pid>
// Investigate I/O anomalies
perf record -e block:block_rq_issue -ag
perf report7. Network
7.1 Explanation
Network monitoring is complex due to latency, blocking, collisions, packet loss, and interactions with routers, switches, and wireless signals. Modern NICs are adaptive, adjusting speed and mode automatically.
7.2 Analysis Tools
7.3 Usage
// Show network statistics
netstat -s
// Show current UDP connections
netstat -nu
// Show UDP port usage
netstat -apu
// Count connections per state
netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
// Show TCP connections
ss -t -a
// Show socket summary
ss -s
// Show all UDP sockets
ss -u -a
// TCP/ETCP state statistics
sar -n TCP,ETCP 1
// Network I/O statistics
sar -n DEV 1
// Capture packets (packet‑wise)
tcpdump -i eth1 host 192.168.1.1 and port 80
// Capture streams and display content
tcpflow -cp host 192.168.1.18. System Load
8.1 Explanation
Load measures the amount of work a computer is doing, essentially the length of the run‑queue. Load Average is the average over 1, 5, 15 minutes.
8.2 Analysis Tools
8.3 Usage
// View load information
uptime
top
vmstat
// Trace system call latency
strace -c -p <pid>
// Trace specific syscalls (e.g., epoll_wait)
strace -T -e epoll_wait -p <pid>
// View kernel logs
dmesg9. Flame Graphs
9.1 Explanation
Flame Graphs, created by Brendan Gregg, visualize CPU call stacks. The Y‑axis shows stack depth (functions), the X‑axis shows sample counts (frequency). Wider bars indicate functions consuming more CPU time.
9.2 Installing Dependencies
// Install systemtap and runtime
yum install systemtap systemtap-runtime
// Install matching kernel debug packages
debuginfo-install --enablerepo=debuginfo kernel
// (additional glibc debug packages as needed)9.3 Installation
git clone https://github.com/lidaohang/quick_location.git
cd quick_location9.4 CPU‑level Flame Graphs
High CPU usage can be pinpointed with flame graphs, revealing which functions dominate.
9.4.1 on‑CPU
// Generate on‑CPU flame graph for a user process
sh ngx_on_cpu_u.sh <pid>
cd ngx_on_cpu_u
// Serve the SVG via a temporary HTTP server
python -m SimpleHTTPServer 8088
// Open http://127.0.0.1:8088/<pid>.svg in a browserDemo code used for profiling:
#include <stdio.h>
#include <stdlib.h>
void foo3(){}
void foo2(){
int i;
for(i=0 ; i < 10; i++)
foo3();
}
void foo1(){
int i;
for(i = 0; i< 1000; i++)
foo3();
}
int main(void){
int i;
for(i =0; i< 1000000000; i++) {
foo1();
foo2();
}
}9.4.2 off‑CPU
// Generate off‑CPU flame graph for a user process
sh ngx_off_cpu_u.sh <pid>
cd ngx_off_cpu_u
python -m SimpleHTTPServer 8088
// Open http://127.0.0.1:8088/<pid>.svg9.5 Memory‑level Flame Graphs
// Generate memory‑level flame graph
sh ngx_on_memory.sh <pid>
cd ngx_on_memory
python -m SimpleHTTPServer 8088
// Open http://127.0.0.1:8088/<pid>.svg9.6 Differential (Red‑Blue) Flame Graphs
When performance regresses, differential flame graphs highlight functions that have increased (red) or decreased (blue) between two profiles.
// Capture baseline profile
perf record -F 99 -p <pid> -g -- sleep 30
perf script > out.stacks1
// Capture new profile
perf record -F 99 -p <pid> -g -- sleep 30
perf script > out.stacks2
// Generate folded stacks
./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1
./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2
// Create differential flame graph
./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff.svg10. Case Study: Nginx Cluster Anomaly
10.1 Observed Symptoms
On 2017‑09‑25 at 19:00 the nginx cluster showed many 499 and 5xx responses, with rising CPU usage.
10.2 Nginx Metrics Analysis
a) Request traffic – the traffic actually decreased, so the spike is not traffic‑related.
b) Response time – increased, possibly due to nginx itself or upstream latency.
c) Upstream response time – also increased, suggesting backend slowdown.
10.3 System CPU Analysis
a) Top output showed high CPU usage by nginx workers.
b) perf top -p revealed most overhead in free, malloc, and JSON parsing.
10.4 Flame Graph CPU Analysis
// Generate on‑CPU flame graph for nginx worker
sh ngx_on_cpu_u.sh <pid>
cd ngx_on_cpu_u
python -m SimpleHTTPServer 8088
// Open http://127.0.0.1:8088/<pid>.svgConclusion: Frequent JSON parsing by a low‑performance JSON library consumed significant CPU.
10.5 Summary
a) Traffic anomaly traced to prolonged upstream response time.
b) High CPU usage caused by costly JSON parsing and memory allocation in nginx.
10.5.1 Deep Dive
The upstream delay does not directly cause the CPU‑intensive JSON parsing, which occurs only during request handling.
10.5.2 Resolution
Disable the high‑CPU module, observe reduced CPU and normalized traffic. The upstream delay was partly due to a loopback call back to nginx.
11. References
http://www.brendangregg.com/index.html
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html
http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html
https://github.com/openresty/openresty-systemtap-toolkit
https://github.com/brendangregg/FlameGraph
https://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs
Author: Lucien_168 Link: https://www.jianshu.com/p/0bbac570fa4c
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.