Operations 20 min read

Master Linux Performance: From 5W2H Methodology to Flame Graphs

This comprehensive guide explains how to diagnose Linux performance issues using a structured 5W2H approach, introduces essential monitoring tools for CPU, memory, disk I/O, and network, and demonstrates practical flame‑graph techniques—including on‑CPU, off‑CPU, memory, and differential analyses—to quickly locate and resolve bottlenecks.

21CTO

Nov 14, 2018

Master Linux Performance: From 5W2H Methodology to Flame Graphs

1. Background

Sometimes we encounter difficult problems that monitoring plugins cannot immediately pinpoint. In such cases we need to log into the server for deeper analysis. Analyzing problems requires experience and broad knowledge, and good tools can greatly speed up locating issues.

2. Description

This article introduces various problem‑location tools and combines case studies for analysis.

3. Problem‑analysis methodology

Applying the 5W2H method yields several performance‑analysis questions:

What – what is the phenomenon?

When – when does it happen?

Why – why does it happen?

Where – where does it happen?

How much – how many resources are consumed?

How to do – how to solve it?

4. CPU

4.1 Description

For applications we usually focus on the kernel CPU scheduler functionality and performance.

Thread‑state analysis examines where thread time is spent. States include:

on‑CPU: running time, split into user‑time and system‑time.

off‑CPU: waiting for the next CPU slice, I/O, locks, paging, etc., with sub‑states such as runnable, anonymous‑page, sleep, lock, idle.

If a large portion of time is on‑CPU, CPU profiling can quickly explain the cause; if most time is off‑CPU, locating the problem is more time‑consuming. Key concepts include processor, core, hardware thread, CPU cache, clock frequency, CPI/IPC, instruction set, utilization, user/kernel time, scheduler, run queue, preemption, multi‑process, multi‑thread, word length.

4.2 Analysis tools

uptime, vmstat, mpstat, top, pidstat – show CPU and load usage.

perf – tracks per‑function CPU time and can target kernel functions.

4.3 Usage

// View system CPU usage
top

// View per‑CPU core info
mpstat -P ALL 1

// View CPU usage and average load
vmstat 1

// Process CPU statistics
pidstat -u 1 -p pid

// Trace per‑function CPU usage of a process
perf top -p pid -e cpu-clock

5. Memory

5.1 Description

Memory is designed for efficiency, but memory problems can affect service availability. Important concepts include main memory, virtual memory, resident memory, address space, OOM, page cache, page fault, swapping, user allocators (libc, glibc, libmalloc, mtmalloc), and the kernel SLUB allocator.

5.2 Analysis tools

free, vmstat, top, pidstat, pmap – report memory usage.

valgrind – detects memory leaks.

dtrace – dynamic tracing of kernel functions via D scripts.

5.3 Usage

// View system memory usage
free -m

// Virtual memory statistics
vmstat 1

// View system memory status
top

// Per‑process memory statistics (1 s interval)
pidstat -p pid -r 1

// View process memory map
pmap -d pid

// Detect memory issues
valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./program

6. Disk I/O

6.1 Description

Disk is the slowest subsystem and a common performance bottleneck because it is far from the CPU and involves mechanical operations. Understanding basic concepts such as file system, VFS, page cache, buffer cache, inode, and I/O scheduler is essential for monitoring I/O performance.

6.2 Analysis tools

6.3 Usage

// View system I/O
iotop

// Detailed I/O statistics
iostat -d -x -k 1 10

// Per‑process I/O
pidstat -d 1 -p pid

// Record I/O requests
perf record -e block:block_rq_issue -ag
perf report

7. Network

7.1 Description

Network monitoring is the most complex Linux subsystem due to factors like latency, blocking, collisions, and packet loss, as well as external devices such as routers and switches that affect overall performance.

7.2 Analysis tools

7.3 Usage

// Show network statistics
netstat -s

// Show current UDP connections
netstat -nu

// Show UDP port usage
netstat -apu

// Count connections per state
netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

// Show TCP connections
ss -t -a

// Show socket summary
ss -s

// Show all UDP sockets
ss -u -a

// TCP/ETCP stats
sar -n TCP,ETCP 1

// Network I/O
sar -n DEV 1

// Packet capture by host and port
tcpdump -i eth1 host 192.168.1.1 and port 80

// Packet capture by flow
tcpflow -cp host 192.168.1.1

8. System Load

8.1 Description

Load measures how much work a computer is doing; Load Average is the average over 1, 5, and 15 minutes.

8.2 Analysis tools

8.3 Usage

// View load
uptime
top
vmstat

// Trace system call latency
strace -c -p pid

// Trace specific syscalls (e.g., epoll_wait)
strace -T -e epoll_wait -p pid

// View kernel logs
dmesg

9. Flame Graphs

9.1 Description

Flame Graphs, created by Brendan Gregg, visualize CPU call stacks. The Y‑axis represents stack depth, the X‑axis represents sample count (not time). Wider blocks indicate functions that consume more CPU.

9.2 Installing dependencies

// Install systemtap
yum install systemtap systemtap-runtime

// Install kernel debug packages matching the running kernel
kernel-debuginfo-$(uname -r).rpm
kernel-devel-$(uname -r).rpm
kernel-debuginfo-common-$(uname -r).rpm

// Install kernel debug info
debuginfo-install --enablerepo=debuginfo search kernel
debuginfo-install --enablerepo=debuginfo search glibc

9.3 Installation

git clone https://github.com/lidaohang/quick_location.git
cd quick_location

9.4 CPU‑level flame graphs

9.4.1 On‑CPU

High CPU usage can be pinpointed to specific functions using flame graphs.

// on‑CPU user
sh ngx_on_cpu_u.sh pid
cd ngx_on_cpu_u
python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/pid.svg

Example C program used for demonstration:

#include <stdio.h>
#include <stdlib.h>

void foo3() {}

void foo2() {
    int i;
    for(i=0; i<10; i++) foo3();
}

void foo1() {
    int i;
    for(i=0; i<1000; i++) foo3();
}

int main(void) {
    int i;
    for(i=0; i<1000000000; i++) {
        foo1();
        foo2();
    }
}

9.4.2 Off‑CPU

Off‑CPU flame graphs show where threads spend time waiting.

// off‑CPU user
sh ngx_off_cpu_u.sh pid
cd ngx_off_cpu_u
python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/pid.svg

9.5 Memory‑level flame graphs

Memory‑level flame graphs help locate memory leaks or excessive allocations.

sh ngx_on_memory.sh pid
cd ngx_on_memory
python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/pid.svg

9.6 Differential (red‑blue) flame graphs

Differential flame graphs compare two profiles to highlight performance regressions (red) or improvements (blue).

cd quick_location
# Record baseline profile
perf record -F 99 -p pid -g -- sleep 30
perf script > out.stacks1

# Record changed profile
perf record -F 99 -p pid -g -- sleep 30
perf script > out.stacks2

# Generate diff flame graph
./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1
./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2
./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff2.svg

10. Case Study: Nginx Cluster Anomaly

10.1 Symptom

On 2017‑09‑25 at 19:00 the Nginx cluster showed many 499 and 5xx responses and a rising CPU usage.

10.2 Nginx metrics analysis

Request traffic did not spike; it actually decreased, indicating the issue is not traffic‑related.

Response time increased, possibly due to Nginx itself or upstream latency.

Upstream response time also grew, suggesting backend services may be slowing Nginx.

10.3 System CPU analysis

Top shows high CPU usage by Nginx workers.

perf top reveals most overhead in free, malloc, and JSON parsing.

10.4 Flame‑graph CPU analysis

User‑mode flame graph identifies JSON parsing as a hot spot.

10.5 Summary

Two root causes were identified: (a) upstream latency causing request delays, and (b) expensive JSON parsing and memory allocation inside Nginx. Disabling the high‑CPU module reduced CPU usage and normalized traffic.

11. References

http://www.brendangregg.com/index.html

http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html

http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html

http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html

https://github.com/openresty/openresty-systemtap-toolkit

https://github.com/brendangregg/FlameGraph

https://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring performance troubleshooting flamegraph

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.