Operations 21 min read

Mastering Linux Performance: From CPU/Memory Profiling to Flame Graphs

This guide explains how to systematically diagnose Linux performance issues using tools such as top, vmstat, perf, and flame graphs, covering CPU, memory, disk I/O, network, and load analysis, and demonstrates a real-world nginx case study with step‑by‑step commands and visualizations.

Open Source Linux
Open Source Linux
Open Source Linux
Mastering Linux Performance: From CPU/Memory Profiling to Flame Graphs

1. Background

Sometimes difficult problems arise that monitoring plugins cannot immediately pinpoint, requiring server login for deeper analysis. Effective problem analysis demands technical experience and broad domain knowledge to locate root causes.

Having good analysis tools accelerates troubleshooting and saves time for deeper work.

2. Explanation

This article introduces various problem‑location tools and combines them with case studies.

3. Problem‑analysis Methodology

Applying the 5W2H method raises key performance analysis questions:

What – what is the phenomenon?

When – when does it occur?

Why – why does it happen?

Where – where does it happen?

How much – how many resources are consumed?

How to do – how to solve it?

4. CPU

4.1 Explanation

For applications we usually focus on kernel CPU scheduler functionality and performance.

Thread‑state analysis examines where thread time is spent. Thread states are generally classified as:

on‑CPU: executing time, split into user‑mode (user) and kernel‑mode (sys).

off‑CPU: waiting for the next CPU slice, I/O, locks, paging, etc., with sub‑states such as runnable, anonymous paging, sleep, lock, idle.

If most time is on‑CPU, CPU profiling quickly reveals reasons; if most time is off‑CPU, locating the issue can be time‑consuming. Key concepts include processor, core, hardware thread, CPU cache, clock frequency, CPI, IPC, instruction set, utilization, user/kernel time, scheduler, run queue, preemption, multi‑process, multi‑thread, word size, etc.

4.2 Analysis Tools

uptime, vmstat, mpstat, top, pidstat – query CPU and load usage.

perf – trace per‑function CPU time inside processes and can target specific kernel functions.

4.3 Usage

// View system CPU usage
 top
// View per‑CPU information
 mpstat -P ALL 1
// View CPU usage and average load
 vmstat 1
// Process‑level CPU statistics
 pidstat -u 1 -p <pid>
// Trace function‑level CPU usage in a process
 perf top -p <pid> -e cpu-clock

5. Memory

5.1 Explanation

Memory issues affect not only performance but also service availability. Important concepts include main memory, virtual memory, resident memory, address space, OOM, page cache, page faults, swapping, allocators (libc, glibc, libmalloc, mtmalloc), and the kernel SLUB allocator.

5.2 Analysis Tools

free, vmstat, top, pidstat, pmap – report memory usage.

valgrind – detect memory leaks.

dtrace – dynamic tracing of kernel functions via D scripts.

5.3 Usage

// View system memory usage
 free -m
// Virtual memory statistics
 vmstat 1
// System memory view
 top
// Per‑process memory statistics (1 s interval)
 pidstat -p <pid> -r 1
// Process memory map
 pmap -d <pid>
// Detect memory leaks
 valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./program

6. Disk I/O

6.1 Explanation

Disk is the slowest subsystem and a common performance bottleneck. Understanding file systems, VFS, caches (page cache, buffer cache, directory cache), inodes, and I/O scheduling strategies is essential.

6.2 Analysis Tools

6.3 Usage

// View system I/O information
 iotop
// Detailed I/O statistics
 iostat -d -x -k 1 10
// Process‑level I/O information
 pidstat -d 1 -p <pid>
// Investigate I/O anomalies
 perf record -e block:block_rq_issue -ag
 perf report

7. Network

7.1 Explanation

Network monitoring is complex due to latency, blocking, collisions, packet loss, and interactions with routers, switches, and wireless signals. Modern NICs are adaptive, adjusting speed and mode automatically.

7.2 Analysis Tools

7.3 Usage

// Show network statistics
 netstat -s
// Show current UDP connections
 netstat -nu
// Show UDP port usage
 netstat -apu
// Count connections per state
 netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
// Show TCP connections
 ss -t -a
// Show socket summary
 ss -s
// Show all UDP sockets
 ss -u -a
// TCP/ETCP state statistics
 sar -n TCP,ETCP 1
// Network I/O statistics
 sar -n DEV 1
// Capture packets (packet‑wise)
 tcpdump -i eth1 host 192.168.1.1 and port 80
// Capture streams and display content
 tcpflow -cp host 192.168.1.1

8. System Load

8.1 Explanation

Load measures the amount of work a computer is doing, essentially the length of the run‑queue. Load Average is the average over 1, 5, 15 minutes.

8.2 Analysis Tools

8.3 Usage

// View load information
 uptime
 top
 vmstat
// Trace system call latency
 strace -c -p <pid>
// Trace specific syscalls (e.g., epoll_wait)
 strace -T -e epoll_wait -p <pid>
// View kernel logs
 dmesg

9. Flame Graphs

9.1 Explanation

Flame Graphs, created by Brendan Gregg, visualize CPU call stacks. The Y‑axis shows stack depth (functions), the X‑axis shows sample counts (frequency). Wider bars indicate functions consuming more CPU time.

9.2 Installing Dependencies

// Install systemtap and runtime
 yum install systemtap systemtap-runtime
// Install matching kernel debug packages
 debuginfo-install --enablerepo=debuginfo kernel
 // (additional glibc debug packages as needed)

9.3 Installation

git clone https://github.com/lidaohang/quick_location.git
 cd quick_location

9.4 CPU‑level Flame Graphs

High CPU usage can be pinpointed with flame graphs, revealing which functions dominate.

9.4.1 on‑CPU

// Generate on‑CPU flame graph for a user process
 sh ngx_on_cpu_u.sh <pid>
 cd ngx_on_cpu_u
 // Serve the SVG via a temporary HTTP server
 python -m SimpleHTTPServer 8088
 // Open http://127.0.0.1:8088/<pid>.svg in a browser

Demo code used for profiling:

#include <stdio.h>
#include <stdlib.h>
void foo3(){}
void foo2(){
  int i;
  for(i=0 ; i < 10; i++)
    foo3();
}
void foo1(){
  int i;
  for(i = 0; i< 1000; i++)
    foo3();
}
int main(void){
  int i;
  for(i =0; i< 1000000000; i++) {
    foo1();
    foo2();
  }
}

9.4.2 off‑CPU

// Generate off‑CPU flame graph for a user process
 sh ngx_off_cpu_u.sh <pid>
 cd ngx_off_cpu_u
 python -m SimpleHTTPServer 8088
 // Open http://127.0.0.1:8088/<pid>.svg

9.5 Memory‑level Flame Graphs

// Generate memory‑level flame graph
 sh ngx_on_memory.sh <pid>
 cd ngx_on_memory
 python -m SimpleHTTPServer 8088
 // Open http://127.0.0.1:8088/<pid>.svg

9.6 Differential (Red‑Blue) Flame Graphs

When performance regresses, differential flame graphs highlight functions that have increased (red) or decreased (blue) between two profiles.

// Capture baseline profile
 perf record -F 99 -p <pid> -g -- sleep 30
 perf script > out.stacks1
// Capture new profile
 perf record -F 99 -p <pid> -g -- sleep 30
 perf script > out.stacks2
// Generate folded stacks
 ./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1
 ./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2
// Create differential flame graph
 ./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff.svg

10. Case Study: Nginx Cluster Anomaly

10.1 Observed Symptoms

On 2017‑09‑25 at 19:00 the nginx cluster showed many 499 and 5xx responses, with rising CPU usage.

10.2 Nginx Metrics Analysis

a) Request traffic – the traffic actually decreased, so the spike is not traffic‑related.

b) Response time – increased, possibly due to nginx itself or upstream latency.

c) Upstream response time – also increased, suggesting backend slowdown.

10.3 System CPU Analysis

a) Top output showed high CPU usage by nginx workers.

b) perf top -p revealed most overhead in free, malloc, and JSON parsing.

10.4 Flame Graph CPU Analysis

// Generate on‑CPU flame graph for nginx worker
 sh ngx_on_cpu_u.sh <pid>
 cd ngx_on_cpu_u
 python -m SimpleHTTPServer 8088
 // Open http://127.0.0.1:8088/<pid>.svg

Conclusion: Frequent JSON parsing by a low‑performance JSON library consumed significant CPU.

10.5 Summary

a) Traffic anomaly traced to prolonged upstream response time.

b) High CPU usage caused by costly JSON parsing and memory allocation in nginx.

10.5.1 Deep Dive

The upstream delay does not directly cause the CPU‑intensive JSON parsing, which occurs only during request handling.

10.5.2 Resolution

Disable the high‑CPU module, observe reduced CPU and normalized traffic. The upstream delay was partly due to a loopback call back to nginx.

11. References

http://www.brendangregg.com/index.html

http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html

http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html

http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html

https://github.com/openresty/openresty-systemtap-toolkit

https://github.com/brendangregg/FlameGraph

https://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs

Author: Lucien_168 Link: https://www.jianshu.com/p/0bbac570fa4c
monitoringPerformanceLinuxnginxprofilingflame graphs
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.