Operations 21 min read

Mastering Linux Performance: From 5W2H Methodology to Flame Graphs

This article introduces a systematic approach to diagnosing Linux performance issues, covering the 5W2H analysis framework, essential CPU, memory, disk I/O and network monitoring tools, practical command examples, flame‑graph generation, and a real‑world nginx case study with actionable insights.

Open Source Linux
Open Source Linux
Open Source Linux
Mastering Linux Performance: From 5W2H Methodology to Flame Graphs

When encountering obscure problems that monitoring plugins cannot instantly pinpoint, deeper server‑side analysis is required; this article presents a comprehensive methodology and toolset for locating performance bottlenecks in Linux systems.

2. Explanation

This article mainly introduces various problem‑location tools and combines case studies for analysis.

3. Problem‑analysis methodology

Applying the 5W2H method yields the following questions:

What – what is the phenomenon?

When – when does it occur?

Why – why does it happen?

Where – where does it happen?

How much – how many resources are consumed?

How to do – how to solve the problem?

4. CPU

4.1 Explanation

For applications we usually focus on kernel CPU scheduler functionality and performance. Thread‑state analysis examines where thread time is spent, with states such as on‑CPU (user and sys) and off‑CPU (runnable, anonymous page, sleep, lock, idle, etc.). Understanding concepts like processor, core, hardware thread, caches, clock frequency, CPI, IPC, usage, scheduler, run queue, preemption, multi‑process/thread, and word length is essential.

4.2 Analysis tools

CPU analysis tools
CPU analysis tools

uptime, vmstat, mpstat, top, pidstat – show CPU and load usage.

perf – can trace function‑level time consumption and specify kernel functions.

4.3 Usage

// View system CPU usage
top

// View per‑CPU information
mpstat -P ALL 1

// View CPU usage and average load
vmstat 1

// Process‑level CPU statistics
pidstat -u 1 -p pid

// Trace function‑level CPU usage in a process
perf top -p pid -e cpu-clock

5. Memory

5.1 Explanation

Memory issues affect not only performance but also service availability. Key concepts include main memory, virtual memory, resident memory, address space, OOM, page cache, page faults, swapping, and allocators such as libc, glibc, libmalloc, mtmalloc, and the kernel SLUB allocator.

5.2 Analysis tools

Memory analysis tools
Memory analysis tools

free, vmstat, top, pidstat, pmap – show memory usage.

valgrind – detects memory leaks.

dtrace – dynamic tracing of kernel functions via D scripts.

5.3 Usage

// View system memory usage
free -m

// Virtual memory statistics
vmstat 1

// View system memory status
top

// Process memory statistics
pidstat -p pid -r 1

// View process memory map
pmap -d pid

// Detect memory leaks
valgrind --tool=memcheck --leak-check=full --log-file=log.txt ./program

6. Disk I/O

6.1 Explanation

Disk is the slowest subsystem and a common performance bottleneck; understanding file systems, VFS, caches, inode, and I/O scheduling is necessary for monitoring.

6.2 Analysis tools

Disk I/O analysis tools
Disk I/O analysis tools

6.3 Usage

// View system I/O information
iotop

// Detailed I/O statistics
iostat -d -x -k 1 10

// Process‑level I/O information
pidstat -d 1 -p pid

// Trace I/O requests
perf record -e block:block_rq_issue -ag
perf report

7. Network

7.1 Explanation

Network monitoring is complex due to latency, blocking, collisions, packet loss, and interactions with routers, switches, and wireless signals; modern NICs adapt automatically to varying conditions.

7.2 Analysis tools

Network analysis tools
Network analysis tools

7.3 Usage

// Show network statistics
netstat -s

// Show current UDP connections
netstat -nu

// Show UDP port usage
netstat -apu

// Count connections per state
netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

// Show TCP connections
ss -t -a

// Show socket summary
ss -s

// Show all UDP sockets
ss -u -a

// Show TCP/ETCP stats
sar -n TCP,ETCP 1

// Show network I/O
sar -n DEV 1

// Capture packets by host and port
tcpdump -i eth1 host 192.168.1.1 and port 80

// Capture and display packet streams
tcpflow -cp host 192.168.1.1

8. System Load

8.1 Explanation

Load measures the amount of work a computer is doing; Load Average is the average over 1, 5, and 15 minutes, representing the length of the process queue.

8.2 Analysis tools

Load analysis tools
Load analysis tools

8.3 Usage

// View load information
uptime

top

vmstat

// Trace system call latency
strace -c -p pid

// Trace specific syscalls (e.g., epoll_wait)
strace -T -e epoll_wait -p pid

// View kernel logs
dmesg

9. Flame Graphs

9.1 Explanation

Flame Graphs, created by Brendan Gregg, visualize CPU call stacks; the y‑axis shows stack depth, the x‑axis shows sample counts. Wider bars indicate functions that consume more CPU time. Different types include on‑CPU, off‑CPU, memory, hot/cold, and differential graphs.

9.2 Installing dependencies

// Install systemtap (if not already installed)
yum install systemtap systemtap-runtime

// Install kernel debug packages matching the kernel version
# Example for kernel 2.6.18-308.el5
kernel-debuginfo-2.6.18-308.el5.x86_64.rpm
kernel-devel-2.6.18-308.el5.x86_64.rpm
kernel-debuginfo-common-2.6.18-308.el5.x86_64.rpm

// Install kernel debug info via yum
debuginfo-install --enablerepo=debuginfo search kernel
debuginfo-install --enablerepo=debuginfo search glibc

9.3 Installation

git clone https://github.com/lidaohang/quick_location.git
cd quick_location

9.4 CPU‑level flame graphs

When CPU usage is high or cannot increase, flame graphs quickly pinpoint the problematic functions.

9.4.1 on‑CPU

CPU time is split into user and system modes.

Usage:

// on‑CPU user
sh ngx_on_cpu_u.sh pid
cd ngx_on_cpu_u

// on‑CPU kernel
sh ngx_on_cpu_k.sh pid
cd ngx_on_cpu_k

# Serve the generated SVG
python -m SimpleHTTPServer 8088
# Then open http://127.0.0.1:8088/pid.svg

9.4.2 off‑CPU

Off‑CPU time includes waiting for CPU, I/O, locks, etc.

Usage:

// off‑CPU user
sh ngx_off_cpu_u.sh pid
cd ngx_off_cpu_u

// off‑CPU kernel
sh ngx_off_cpu_k.sh pid
cd ngx_off_cpu_k

python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/pid.svg

9.5 Memory‑level flame graphs

Memory flame graphs help locate memory leaks or excessive allocations.

Usage:

sh ngx_on_memory.sh pid
cd ngx_on_memory
python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/pid.svg

9.6 Performance regression – red/blue differential flame graphs

By capturing two profiles (before and after a change) and generating a differential flame graph, red areas indicate increased cost, blue areas indicate decreased cost.

// Capture baseline profile
perf record -F 99 -p pid -g -- sleep 30
perf script > out.stacks1

// Capture post‑change profile
perf record -F 99 -p pid -g -- sleep 30
perf script > out.stacks2

// Generate folded stacks
./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1
./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2

// Generate differential flame graph
./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff2.svg

10. Case Study – Nginx Cluster Anomaly

10.1 Observation

On 2017‑09‑25 19:00, the Nginx cluster showed a surge of 499 and 5xx responses and increased CPU usage.

10.2 Nginx metrics analysis

a) Request traffic

Nginx request traffic
Nginx request traffic

Conclusion: Traffic did not spike; it actually decreased, so the issue is not traffic‑related.

b) Response time

Nginx response time
Nginx response time

Conclusion: Response time increased, possibly due to Nginx itself or upstream latency.

c) Upstream response time

Nginx upstream response time
Nginx upstream response time

Conclusion: Upstream response time increased, likely dragging Nginx performance.

10.3 System CPU analysis

a) Top output

Top snapshot
Top snapshot

Conclusion: Nginx worker CPU usage is high.

b) Perf top on Nginx process

Command: perf top -p pid Conclusion: Main overhead comes from free, malloc, and JSON parsing.

10.4 Flame‑graph CPU analysis

Generated user‑mode CPU flame graph shows heavy JSON parsing and memory allocation.

CPU flame graph
CPU flame graph

10.5 Summary

a) Traffic analysis revealed upstream latency as the root cause of request anomalies.

b) CPU profiling identified costly JSON parsing and memory allocation inside Nginx modules.

Solution: Disable the high‑CPU module, observe reduced CPU usage and normalized traffic; upstream latency was caused by a loopback call to Nginx.

11. References

Brendan Gregg – Performance Analysis

CPU Flame Graphs

Memory Flame Graphs

Off‑CPU Flame Graphs

Differential Flame Graphs

OpenResty SystemTap Toolkit

FlameGraph Repository

Blazing Performance with Flame Graphs

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Linuxflame graphsystem-monitoringperformance analysisCPU profiling
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.