Operations 20 min read

Linux Performance Mastery: Tools, 5W2H Methodology & Flame Graph Case Study

This comprehensive guide explains how to diagnose Linux system performance issues using a structured 5W2H approach, covering CPU, memory, disk I/O, network, load, and flame‑graph analysis, with practical command examples and a real‑world nginx case study.

Efficient Ops

Nov 13, 2018

Linux Performance Mastery: Tools, 5W2H Methodology & Flame Graph Case Study

1. Background

When complex problems arise that monitoring plugins cannot instantly pinpoint, deeper server analysis is required. Effective analysis demands experience and broad knowledge, and having the right tools can dramatically speed up root‑cause identification.

2. Overview

This article introduces various problem‑location tools and demonstrates their use with case studies.

3. Problem‑analysis Methodology

Applying the 5W2H method helps formulate performance‑analysis questions:

What – what is the phenomenon?

When – when does it occur?

Why – why does it happen?

Where – where does it happen?

How much – how many resources are consumed?

How to do – how to solve it?

4. CPU

4.1 Overview

For applications we focus on kernel CPU scheduler functionality and performance. Thread‑state analysis distinguishes on‑CPU (user and sys time) and off‑CPU (waiting, I/O, locks, paging, etc.). Understanding concepts such as processor, core, hardware thread, cache, CPI, IPC, scheduler, run‑queue, preemption, multi‑process/thread, and word size is essential.

4.2 Analysis Tools

uptime, vmstat, mpstat, top, pidstat – basic CPU/load metrics.

perf – detailed per‑function CPU usage, can target kernel functions.

4.3 Usage

// View overall CPU usage
 top

// Show per‑CPU statistics
 mpstat -P ALL 1

// Show CPU usage and load average
 vmstat 1

// Per‑process CPU stats
 pidstat -u 1 -p <pid>

// Trace function‑level CPU usage for a process
 perf top -p <pid> -e cpu-clock

5. Memory

5.1 Overview

Memory issues affect not only performance but also service availability. Key concepts include main memory, virtual memory, resident set, address space, OOM, page cache, page faults, swapping, and allocators (libc, glibc, jemalloc, SLUB).

5.2 Analysis Tools

free, vmstat, top, pidstat, pmap – memory usage statistics.

valgrind – memory leak detection.

dtrace – dynamic tracing of kernel functions (requires D language scripts).

5.3 Usage

// Show system memory usage
 free -m

// Show virtual memory stats
 vmstat 1

// Show memory usage via top
 top

// Per‑process memory stats (1‑second interval)
 pidstat -p <pid> -r 1

// Show process memory map
 pmap -d <pid>

// Detect memory leaks with valgrind
 valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./program

6. Disk I/O

6.1 Overview

Disk is the slowest subsystem and a common performance bottleneck. Understanding filesystem, VFS, page cache, buffer cache, inode, and I/O schedulers (e.g., noop) is necessary before monitoring.

6.2 Analysis Tools

6.3 Usage

// View I/O activity
 iotop

// Detailed I/O stats
 iostat -d -x -k 1 10

// Per‑process I/O stats
 pidstat -d 1 -p <pid>

// Investigate abnormal I/O with perf
 perf record -e block:block_rq_issue -ag
 ^C
 perf report

7. Network

7.1 Overview

Network monitoring is complex due to latency, loss, congestion, and external devices (routers, switches, wireless). Modern NICs are adaptive, adjusting to link conditions.

7.2 Analysis Tools

7.3 Usage

// Network statistics
 netstat -s

// UDP connections
 netstat -nu

// UDP port usage
 netstat -apu

// Count connections per state
 netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

// Show TCP sockets
 ss -t -a

// Summary of sockets
 ss -s

// Show UDP sockets
 ss -u -a

// TCP/ETCP stats
 sar -n TCP,ETCP 1

// Network I/O stats
 sar -n DEV 1

// Capture packets (host & port filter)
 tcpdump -i eth1 host 192.168.1.1 and port 80

// Capture flows
 tcpflow -cp host 192.168.1.1

8. System Load

8.1 Overview

Load measures the amount of work the system is doing; Load Average is the average over 1, 5, and 15 minutes, representing the length of the run‑queue.

8.2 Analysis Tools

8.3 Usage

// View load
 uptime
 top
 vmstat

// Trace system call latency
 strace -c -p <pid>

// Trace specific syscalls (e.g., epoll_wait)
 strace -T -e epoll_wait -p <pid>

// Kernel logs
 dmesg

9. Flame Graphs

9.1 Overview

Flame Graphs, created by Brendan Gregg, visualize CPU call stacks. The Y‑axis shows stack depth, the X‑axis shows sample counts. Wider blocks indicate functions that consume more CPU time.

9.2 Install Dependencies

// Install systemtap (usually pre‑installed)
 yum install systemtap systemtap-runtime

// Install matching kernel debuginfo packages
 kernel-debuginfo-$(uname -r).rpm
 kernel-devel-$(uname -r).rpm
 kernel-debuginfo-common-$(uname -r).rpm

// Install kernel debuginfo via repo
 debuginfo-install --enablerepo=debuginfo search kernel
 debuginfo-install --enablerepo=debuginfo search glibc

9.3 Clone Tools

git clone https://github.com/lidaohang/quick_location.git
cd quick_location

9.4 On‑CPU Flame Graph

High CPU usage can be pinpointed to specific functions using on‑CPU flame graphs.

9.4.1 on‑CPU

// on‑CPU user mode
 sh ngx_on_cpu_u.sh <pid>
 cd ngx_on_cpu_u
 python -m SimpleHTTPServer 8088
 # Open 127.0.0.1:8088/pid.svg in a browser

Demo C program used for generating the graph:

#include <stdio.h>
#include <stdlib.h>

void foo3() {}

void foo2() {
  int i;
  for(i=0 ; i < 10; i++)
    foo3();
}

void foo1() {
  int i;
  for(i = 0; i< 1000; i++)
    foo3();
}

int main(void) {
  int i;
  for(i =0; i< 1000000000; i++) {
    foo1();
    foo2();
  }
}

9.4.2 off‑CPU

Off‑CPU graphs reveal time spent waiting (I/O, locks, paging, etc.).

// off‑CPU user mode
 sh ngx_off_cpu_u.sh <pid>
 cd ngx_off_cpu_u
 python -m SimpleHTTPServer 8088
 # Open 127.0.0.1:8088/pid.svg

9.5 Memory Flame Graph

Useful for locating memory leaks.

sh ngx_on_memory.sh <pid>
cd ngx_on_memory
python -m SimpleHTTPServer 8088
# Open 127.0.0.1:8088/pid.svg

9.6 Differential (Red‑Blue) Flame Graphs

Compare two profiles to spot performance regressions; red indicates increase, blue indicates decrease.

// Capture baseline profile
 perf record -F 99 -p <pid> -g -- sleep 30
 perf script > out.stacks1

// Capture new profile
 perf record -F 99 -p <pid> -g -- sleep 30
 perf script > out.stacks2

// Generate diff flame graph
 ./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1
 ./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2
 ./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff2.svg

Demo C programs (original vs. modified) illustrate the diff graph.

10. Case Study: Nginx Cluster Anomaly

10.1 Symptoms

On 2017‑09‑25 at 19:00, monitoring showed a surge of 499 and 5xx responses from the Nginx cluster, accompanied by rising CPU usage.

10.2 Nginx Metrics Analysis

Traffic graphs indicated no spike; traffic actually decreased, ruling out a traffic surge.

Response‑time graphs showed increased latency, possibly due to Nginx itself or upstream services.

Upstream response‑time graphs confirmed that backend latency contributed to the issue.

10.3 System CPU Analysis

Using top revealed high CPU usage on Nginx workers.

Running perf top -p <pid> showed most overhead in memory allocation, free, and JSON parsing.

10.4 Flame‑Graph CPU Analysis

On‑CPU user flame graph highlighted heavy JSON parsing as the dominant consumer.

10.5 Summary

Two root causes were identified: upstream latency affecting request flow, and inefficient JSON parsing within Nginx causing high CPU usage. The latter was mitigated by disabling the problematic module, which lowered CPU load and restored normal request rates.

11. References

http://www.brendangregg.com/index.html

http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html

http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html

http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html

https://github.com/openresty/openresty-systemtap-toolkit

https://github.com/brendangregg/FlameGraph

https://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs

Author: Lucien_168 – Source: 简书 (https://www.jianshu.com/p/0bbac570fa4c)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux flame graph CPU analysis system debugging

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.