Master Linux Perf: From Event Subsystem to Flame Graphs
This article provides a comprehensive guide to Linux Perf, covering its architecture, counting and sampling modes, event classifications, the full suite of Perf tools, and how to generate and interpret CPU and Off‑CPU flame graphs for deep performance analysis.
Perf Event Subsystem
Perf is a built‑in Linux kernel profiling tool that uses event sampling to collect hardware and software performance metrics, helping locate bottlenecks and hot code paths.
The overall Perf architecture consists of two parts: Perf Tools (user‑space utilities for data collection and analysis) and the Perf Event Subsystem (kernel component that gathers raw data, also used by the Linux Hard Lockup Detector).
Perf Working Modes
1. Counting Mode
Counts hardware counter values over a period. Perf tools set the appropriate performance registers, read them after the monitoring interval, and report results. Typical tool: perf stat.
2. Sampling Mode
Periodically samples performance data. PMU counters overflow at configured intervals, capturing IP, registers, and flags. Typical tool: perf record.
Perf Events Classification
Counting Events
# CPU counter statistics for a command
perf stat command
# CPU counter statistics for a PID until Ctrl‑C
perf stat -p PID
# System‑wide statistics for 5 seconds
perf stat -a sleep 5
# Various basic CPU statistics system‑wide for 10 seconds
perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10
# L1‑dcache statistics for a command
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command
# Syscall count per second system‑wide
perf stat -e raw_syscalls:sys_enter -I 1000 -a
# Scheduler events for a PID until Ctrl‑C
perf stat -e 'sched:*' -p PID
# Ext4 events system‑wide for 10 seconds
perf stat -e 'ext4:*' -a sleep 10
# Block I/O events system‑wide for 10 seconds
perf stat -e 'block:*' -a sleep 10Profiling Events
# Sample on‑CPU functions for a command at 99 Hz
perf record -F 99 command
# Sample on‑CPU functions for a PID at 99 Hz until Ctrl‑C
perf record -F 99 -p PID
# Sample CPU stack traces for a PID at 99 Hz for 10 s
perf record -F 99 -p PID -g -- sleep 10
# Force cpu‑clock event if default fails
perf record -F 99 -e cpu-clock -ag -- sleep 10
# Sample on‑CPU kernel instructions for 5 s
perf record -e cycles:k -a -- sleep 5
# Sample on‑CPU user instructions for 5 s
perf record -e cycles:u -a -- sleep 5Static Tracing Events
# Trace new processes
perf record -e sched:sched_process_exec -a
# Sample context‑switches
perf record -e context-switches -a
# Trace all context‑switches with stack traces
perf record -e context-switches -ag
# Trace block device requests with stack traces
perf record -e block:block_rq_insert -ag
# Trace minor faults with stack traces
perf record -e minor-faults -ag
# Trace ext4 calls and write to non‑ext4 location
perf record -e 'ext4:*' -o /tmp/perf.data -aDynamic Tracing Events
# Add a tracepoint for kernel tcp_sendmsg entry
perf probe --add tcp_sendmsg
# Remove the tracepoint
perf probe -d tcp_sendmsg
# Add a tracepoint for tcp_sendmsg return
perf probe 'tcp_sendmsg%return'
# Show available variables for tcp_sendmsg
perf probe -V tcp_sendmsg
# Add a user‑level probe for malloc in libc
perf probe -x /lib64/libc.so.6 mallocPerf Tool Suite Overview
Perf provides 22 sub‑tools. Key utilities include:
perf list : Shows supported events and their attributes (e.g., u for user‑space only, k for kernel only).
perf stat : System‑wide or per‑process statistical analysis.
perf top : Real‑time view of hottest functions (default event: cycles).
perf record : Captures raw performance samples.
perf script : Decodes perf record output; supports filtering, symbol resolution, and various output fields.
Flame Graphs
Flame graphs visualize sampled call stacks as stacked bars, where the X‑axis width reflects sample frequency. Five common types exist; this article focuses on CPU and Off‑CPU flame graphs.
CPU Flame Graph
Shows hot functions executing on the CPU during the sampling window. Each rectangle represents a function in the call stack; deeper stacks appear lower, and wider bars indicate more samples.
Off‑CPU Flame Graph
Highlights why a thread spent time sleeping (e.g., waiting for I/O, locks, or scheduler). It is generated by tracing scheduling‑related events with Perf Static Tracer and merging them via Perf Inject.
Related Tracepoints
sched:sched_switch: Records the reason for a context switch (time‑slice expiry, I/O wait, lock wait, voluntary yield). sched:sched_stat_sleep: Measures time a task spends sleeping after voluntarily yielding the CPU. sched:sched_stat_iowait: Measures time a task waits for disk or network I/O. sched:sched_stat_blocked: Measures time a task waits for a kernel lock. sched:sched_stat_wait: Measures time a task spends in the run‑queue before execution.
Example: Measuring Sleep Time of S‑State Processes
# Enable scheduler statistics
echo 1 > /proc/sys/kernel/sched_schedstats
# Record sleep and switch events system‑wide
perf record -e sched:sched_stat_sleep -e sched:sched_switch -a -g -o perf.data.raw sleep 1
# Merge events into a single perf.data file
perf inject -s -i perf.data.raw -o perf.dataAfter injection, perf script -i perf.data shows combined entries where each sched_switch line is followed by the corresponding sched_stat_sleep line, revealing both the call stack that led to the sleep and the duration of the sleep.
Conclusion
The article introduced the Linux Perf Event subsystem architecture, explained Perf's counting and sampling modes, detailed the various event categories, described the full Perf tool suite, and demonstrated how to generate and interpret CPU and Off‑CPU flame graphs, giving readers a deeper understanding of Linux performance profiling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
