Operations 19 min read

Mastering Linux Performance: From CPU to Flame Graphs

This article presents a comprehensive guide to Linux performance analysis, covering background, methodology, tools, and step‑by‑step case studies for CPU, memory, disk I/O, network, system load, and flame‑graph techniques to quickly locate and resolve bottlenecks.

Efficient Ops

Jul 7, 2020

Mastering Linux Performance: From CPU to Flame Graphs

Background

When monitoring plugins cannot immediately reveal the root cause of issues, deeper server‑side analysis is required; this demands technical experience and a broad knowledge base to pinpoint problems efficiently.

Explanation

The article introduces various problem‑location tools and demonstrates their use with real‑world cases.

Problem‑analysis Methodology

Applying the 5W2H method helps formulate performance‑analysis questions:

What – what is the phenomenon?

When – when does it occur?

Why – why does it happen?

Where – where does it happen?

How much – how many resources are consumed?

How to do – how to solve it?

CPU

Explanation

For applications, we usually focus on kernel CPU scheduler functionality and performance.

Thread‑state analysis classifies time spent as:

on‑CPU: execution time, divided into user‑mode (user) and kernel‑mode (sys).

off‑CPU: waiting for the next CPU slice, I/O, locks, paging, etc., with sub‑states such as runnable, anonymous page, sleep, lock, idle.

If most time is on‑CPU, CPU profiling can quickly explain the cause; if time is off‑CPU, diagnosis becomes more time‑consuming.

Processor

Core

Hardware thread

CPU cache

Clock frequency

CPI / IPC

CPU instructions

Utilization

User / kernel time

Scheduler

Run queue

Preemption

Multi‑process

Multi‑thread

Word length

Analysis Tools

Typical tools:

uptime, vmstat, mpstat, top, pidstat – monitor CPU and load.

perf – trace per‑function CPU usage, can target kernel functions.

Usage

// View system CPU usage with top</code><code>// Show per‑CPU info: mpstat -P ALL 1</code><code>// Show CPU usage and average load: vmstat 1</code><code>// Process CPU stats: pidstat -u 1 -p <pid></code><code>// Trace process functions: perf top -p <pid> -e cpu-clock

Memory

Explanation

Memory issues affect not only performance but also service availability; key concepts include:

Main memory

Virtual memory

Resident memory

Address space

OOM

Page cache

Page faults

Swapping

Allocators (libc, glibc, jemalloc, tcmalloc)

Linux SLUB allocator

Analysis Tools

Common utilities:

free, vmstat, top, pidstat, pmap – show memory usage.

valgrind – detect memory leaks.

dtrace – dynamic tracing of kernel functions via D scripts.

Usage

// Show system memory usage: free -m</code><code>// Virtual memory stats: vmstat 1</code><code>// System memory view: top</code><code>// Per‑process memory stats: pidstat -p <pid> -r 1</code><code>// Process memory map: pmap -d <pid></code><code>// Detect memory leaks: valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./program

Disk I/O

Explanation

Disk is the slowest subsystem and a common performance bottleneck due to mechanical latency; understanding basic concepts such as file system, VFS, caches, inode, and I/O schedulers is essential.

Analysis Tools

Typical tools include iostat, iotop, and related utilities.

Usage

// View I/O stats: iotop</code><code>// Detailed I/O: iostat -d -x -k 1 10</code><code>// Process‑level I/O: pidstat -d 1 -p <pid></code><code>// Block request tracing: perf record -e block:block_rq_issue -ag</code><code>// Report block traces: perf report

Network

Explanation

Network monitoring is complex due to latency, blocking, collisions, packet loss, and external equipment influences; adaptive NICs adjust to varying network conditions.

Analysis Tools

Common commands: netstat, ss, sar, tcpdump, tcpflow.

Usage

// Show network stats: netstat -s</code><code>// Show UDP connections: netstat -nu</code><code>// Show UDP port usage: netstat -apu</code><code>// Count connections per state: netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'</code><code>// Show sockets summary: ss -s</code><code>// Show all UDP sockets: ss -u -a</code><code>// Show all TCP sockets: ss -t -a</code><code>// TCP connection stats: sar -n TCP,ETCP 1</code><code>// Network I/O stats: sar -n DEV 1</code><code>// Packet capture: tcpdump -i eth1 host 192.168.1.1 and port 80</code><code>// Flow capture: tcpflow -cp host 192.168.1.1

System Load

Explanation

Load measures the amount of work the system is doing; Load Average is the average over 1, 5, and 15 minutes, representing the length of the run‑queue.

Analysis Tools

Typical tools: uptime, top, vmstat, strace, dmesg.

Usage

// View load: uptime</code><code>// System overview: top</code><code>// VM statistics: vmstat</code><code>// System call timing: strace -c -p <pid></code><code>// Trace specific syscall (e.g., epoll_wait): strace -T -e epoll_wait -p <pid></code><code>// Kernel log: dmesg

Flame Graphs

Explanation

Flame graphs visualize CPU call stacks; the y‑axis represents stack depth, the x‑axis represents sample count. Wider bars indicate functions that consume more CPU time.

Installation

// Install systemtap and runtime</code><code>// Install matching kernel debug/devel packages</code><code>// Clone repository: git clone https://github.com/lidaohang/quick_location.git && cd quick_location

On‑CPU Flame Graph

Generate and view on‑CPU flame graphs to locate hot functions.

// on‑CPU: sh ngx_on_cpu_u.sh <pid></code><code>// cd ngx_on_cpu_u</code><code>// Serve SVG: python -m SimpleHTTPServer 8088</code><code>// Open http://127.0.0.1:8088/<pid>.svg

Off‑CPU Flame Graph

// off‑CPU: sh ngx_off_cpu_u.sh <pid></code><code>// cd ngx_off_cpu_u</code><code>// Serve SVG: python -m SimpleHTTPServer 8088</code><code>// Open http://127.0.0.1:8088/<pid>.svg

Memory‑Level Flame Graph

// Generate memory flame graph: sh ngx_on_memory.sh <pid></code><code>// cd ngx_on_memory</code><code>// Serve SVG: python -m SimpleHTTPServer 8088

Differential Flame Graph (Red‑Blue)

Compare two profiles to highlight performance regressions.

// Capture before change: perf record -F 99 -p <pid> -g -- sleep 30 && perf script > out.stacks1</code><code>// Capture after change: perf record -F 99 -p <pid> -g -- sleep 30 && perf script > out.stacks2</code><code>// Collapse stacks</code><code>./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1</code><code>./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2</code><code>// Generate diff flame graph</code><code>./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff.svg

Case Study: Nginx Cluster Anomaly

Symptoms

On 2017‑09‑25, the Nginx cluster showed many 499 and 5xx responses, with rising CPU usage.

Analysis Steps

Check request traffic – no spike, traffic actually decreased.

Inspect Nginx response time – increased, possibly due to Nginx or upstream latency.

Examine upstream response time – also increased, suggesting backend delay.

Observe system CPU via top – Nginx workers consume high CPU.

Profile Nginx process with perf top – heavy cost in free, malloc, JSON parsing.

Generate on‑CPU flame graph – identified frequent JSON parsing as CPU hotspot.

Conclusion

The high CPU usage stemmed from an inefficient JSON parsing module within Nginx; disabling the module reduced CPU load and normalized request traffic.

References

http://www.brendangregg.com/

http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html

http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html

http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html

https://github.com/openresty/openresty-systemtap-toolkit

https://github.com/brendangregg/FlameGraph

https://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux flame graph performance-analysis CPU profiling

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.