Inside Linux Perf: How the Kernel’s Powerful Tracing Tool Works
The article introduces Linux’s built‑in performance analysis tool perf, explains its event‑driven sampling, tracing and profiling capabilities, shows how to install it on various distributions, demonstrates common commands with real code examples, and discusses practical scenarios for locating and optimizing kernel and application performance issues.
1. Perf Tool Overview
1.1 What is Perf
Perf (short for Performance) is a profiling tool integrated into the Linux kernel. It uses an event‑driven mechanism to capture hardware, software, and kernel‑level performance events, allowing developers to inspect system behavior in detail.
1.2 Powerful Features of Perf
Perf has evolved alongside the Linux kernel, expanding from simple monitoring to a rich set of capabilities. It can monitor hardware events such as CPU cycles and cache hits, software events like context switches and page faults, as well as kernel tracepoints and dynamic tracing.
(1) Lightweight Event Sampling
Perf can sample processor events via hardware performance counters, reporting metrics such as the number of executed instructions, cache hit/miss counts, and branch‑prediction success rates. These metrics act as key signals that reveal the underlying performance characteristics of a program.
(2) Trace Functionality
Perf’s trace feature records function‑call chains for processes or the kernel, producing call graphs and flame graphs that visualize execution time and stack depth. By examining a flame graph, developers can quickly spot functions that dominate CPU usage.
(3) Profiling Functionality
Perf can profile a specific application to identify the most time‑consuming functions and even individual source lines. It provides per‑line statistics, enabling developers to see precisely which code paths affect performance the most.
(4) Benchmarking
Perf can be combined with external benchmark tools such as sysbench and fio to evaluate overall system performance under different workloads, including CPU‑bound, memory‑bound, and I/O‑bound scenarios.
(5) Debugging Functionality
When used together with debuggers like gdb, Perf can capture trace data and correlate it with source code, showing function call frequencies, execution times, and variable changes, which helps pinpoint performance bottlenecks.
2. Installing and Using Perf
2.1 Installing Perf
On Debian‑based systems, Perf is provided by the linux-tools-common and linux-tools-`uname -r` packages:
sudo apt-get install linux-tools-common linux-tools-`uname -r`On Red Hat or CentOS, install it with yum: sudo yum install perf If a package manager is unavailable, Perf can be built from the Linux kernel source:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cd linux/tools/perf
makeAfter compilation, copy the binary to a directory in $PATH: sudo cp perf /usr/local/bin/ Verify the installation with:
perf --version2.2 Basic Usage of Perf
The default performance event is cpu-cycles, representing the smallest time unit the CPU can measure. Perf commands follow the pattern perf COMMAND [-e event …] PROGRAM, where COMMAND can be top, stat, record, report, etc., and multiple events are specified with repeated -e options.
Perf bundles 22 sub‑tools; the most common five are:
perf-list
perf-stat
perf-top
perf-record
perf-report
perf-trace
(1) perf-list
Lists all symbolic event types supported by the system, including hardware, software, and tracepoint events. perf list (2) perf-stat
Example program t1.c demonstrates a CPU‑bound workload. After compiling: gcc -o t1 -g t1.c Running perf stat ./t1 yields output such as:
Performance counter stats for './t1':
218.584169 task-clock # 0.997 CPUs utilized
18 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
82 page-faults # 0.000 M/sec
771,180,100 cycles # 3.528 GHz
550,703,114 instructions # 0.71 insns per cycle
110,117,522 branches # 503.776 M/sec
5,009 branch-misses # 0.00% of all branches
0.219155248 seconds time elapsed
Program t1 is CPU‑bound because task‑clock is close to 1 second.The default statistics include task‑clock, context‑switches, cache‑misses, CPU‑migrations, cycles, IPC, cache references, and more.
(3) perf-top
Running perf top while a simple infinite‑loop program t2.c is executing shows real‑time percentages, e.g.:
Events: 8K cycles
98.67% t2 [.] main
1.10% [kernel] [k] __do_softirq
0.07% [kernel] [k] _raw_spin_unlock_irqrestoreThis quickly identifies the hot function ( main in the example).
(4) perf-record and perf-report
To obtain a call‑graph, record with -g and then report:
perf record -e cpu-clock -g ./t1
perf reportThe report shows that 100 % of the sampled time is spent in longa(), with 91.85 % of that time attributed to foo1() (which calls longa() 100 times) and 8.15 % to foo2().
(5) Tracepoint Example
Counting system calls for ls using the raw_syscalls:sys_enter tracepoint:
perf stat -e raw_syscalls:sys_enter ls
Performance counter stats for 'ls':
111 raw_syscalls:sys_enter
0.001557549 seconds time elapsed2.3 Detailed Explanation of Common Commands
(1) List all measurable events perf list displays hardware events (e.g., cycles, instructions, cache-misses), software events (e.g., context-switches, page-faults), and tracepoints.
(2) Show statistics
Running perf stat ls produces a breakdown of task‑clock, cycles, instructions, branches, cache‑misses, etc., helping users understand the program’s performance profile.
(3) Real‑time system view perf top displays the functions consuming the most CPU cycles in descending order. Options such as -p <PID> limit the view to a specific process, -e cache-misses focuses on a particular event, -a aggregates all CPUs, and -K hides kernel symbols.
(4) Record and generate report perf record -g ls captures call‑graph data; perf report -i perf.data opens an interactive view where users can navigate functions, view assembly, and filter by DSO, command, or symbol.
3. Application Scenarios and Importance of Perf
3.1 Locating Performance Problems
Perf helps diagnose high CPU utilization by revealing which processes and functions dominate CPU cycles. In cache‑miss heavy workloads, perf stat can display L1/L2 cache statistics to guide data‑layout optimizations. For memory‑I/O bottlenecks, events such as page-faults, dTLB-loads, and dTLB-load-misses expose inefficient memory access patterns.
3.2 Keys to Performance Optimization
Analyzing IPC (instructions per cycle) indicates how well code utilizes the processor; low IPC suggests opportunities for instruction‑level improvements. Monitoring memory usage, allocation patterns, and cache behavior informs decisions about data structures and allocation strategies. Call‑stack tracing pinpoints hotspot functions, allowing targeted algorithmic or code‑level refinements.
4. Common Performance Problem Analysis
Typical performance testing follows these steps:
Requirement analysis
Script preparation
Test execution
Result aggregation
Problem analysis
A service loads a 1 GB word list into memory, performs fuzzy matching on incoming requests, forwards matches to a backend HTTP service, returns the response, and records a request identifier and count in MySQL.
Key functions: fuzzyMatching, sendingRequest, buildResponse, signNum (MySQL counter)
Four test groups illustrate typical bottlenecks and Perf‑guided analyses:
Group 1: Random requests at 1 k QPS show no CPU, memory, or bandwidth saturation; the limitation is likely the backend service. Perf can confirm that the backend is the bottleneck.
Group 2: After fixing the backend, 30‑character requests cause CPU load to max out at 400 QPS. Perf + flame graphs reveal that fuzzyMatching consumes the majority of CPU cycles, suggesting code‑level optimization.
Group 3: With backend and matching optimized, random requests reach 3 k QPS but then drop to 1 k QPS intermittently. Perf shows increased MySQL latency, indicating that direct database writes at high concurrency are a scalability issue; a cache layer (e.g., Redis) would be advisable.
Group 4: Replacing the backend with a real service caps QPS at 300 due to network bandwidth saturation. Perf helps verify that the network interface is the limiting factor and guides decisions about bandwidth upgrades or request sharding.
Using Perf together with flame graphs provides a visual way to locate time‑consuming functions:
Record data while stressing the program:
perf record -e cpu-clock -g -p 11110 -o data/perf.data sleep 30Unfold the raw data: perf -i perf.data > perf.unfold Collapse symbols: ./stackcollapse-perf.pl perf.unfold > perf.folded Generate an SVG flame graph: ./flamegraph.pl perf.folded > perf.svg The resulting flame graph highlights the functions that dominate execution time, enabling developers to focus optimization effort where it matters most. Native Perf works best with C/C++ programs compiled with debug symbols; other languages (Java, Go, etc.) require language‑specific tooling that produces compatible perf data.
Linux Code Review Hub
A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
