How Perf Works: Inside Linux Kernel’s Powerful Tracing and Profiling Tool
This article explains the Linux kernel’s perf utility, covering its architecture, key features such as lightweight event sampling, tracing, profiling and debugging, step‑by‑step installation, common commands with real code examples, and how to use perf and flame graphs to locate and optimise performance bottlenecks.
In the vast landscape of the Linux kernel, performance optimisation and fault diagnosis are two towering challenges for developers. The perf tool stands out as a "magic" instrument that can precisely probe every corner of the kernel, from CPU cycles and cache‑hit rates to function call frequencies and context switches.
1. Perf Overview
Perf (short for Performance) is an event‑driven profiling suite built into the Linux kernel. It captures hardware, software and kernel tracepoint events, allowing developers to observe the system’s inner workings.
1.2 Core Capabilities
Lightweight event sampling : Uses hardware performance counters to sample processor events such as instruction count, cache hits/misses and branch‑prediction success, providing a fine‑grained view of program behaviour.
Trace : Records function‑call chains for processes or the kernel, producing call‑graph visualisations (e.g., flame graphs) that reveal where time is spent.
Profiling : Generates per‑function statistics, including per‑line instruction counts, enabling developers to pinpoint hot code paths.
Benchmarking : Can be combined with tools like sysbench or fio to evaluate overall system performance under load.
Debugging : Works with gdb to attach detailed performance data to source lines, helping to locate the root cause of slowdowns.
2. Installing and Using Perf
2.1 Installation
On Debian/Ubuntu the tool is provided by linux-tools-common and linux-tools-$(uname -r):
sudo apt-get install linux-tools-common linux-tools-`uname -r`On Red Hat/CentOS the package is called perf: sudo yum install perf If a package manager is unavailable, perf can be built from the kernel source:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cd linux/tools/perf
make
sudo cp perf /usr/local/bin/Verify the installation with perf --version.
2.2 Basic Usage
The default event is cpu-cycles. The generic command format is:
perf <COMMAND> [-e <event> ...] <PROGRAM>Perf bundles 22 sub‑tools; the most common five are:
perf‑list – show all supported events.
perf‑stat – collect summary statistics after a program finishes.
perf‑top – live view of the hottest functions.
perf‑record – capture detailed sampling data.
perf‑report – analyse the data produced by perf‑record.
perf‑trace – trace system calls and kernel events.
Example: perf‑stat on a simple program
//t1.c
void longa(){
int i,j;
for(i = 0; i < 1000000; i++)
j=i;
}
void foo1(){ for(int i=0;i<100;i++) longa(); }
void foo2(){ for(int i=0;i<10;i++) longa(); }
int main(){ foo1(); foo2(); }Compile and run:
gcc -o t1 -g t1.c
root@ubuntu-test:~# perf stat ./t1Sample output (excerpt):
Performance counter stats for './t1':
218.584169 task-clock # 0.997 CPUs utilized
18 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
82 page-faults # 0.000 M/sec
771,180,100 cycles # 3.528 GHz
550,703,114 instructions # 0.71 insns per cycle
110,117,522 branches # 503.776 M/sec
5,009 branch-misses # 0.00% of all branches
0.219155248 seconds time elapsed
Program t1 is CPU‑bound because task‑clock is close to 1 s.The report explains each metric (task‑clock, context‑switches, cache‑misses, etc.) and how they indicate whether a program is CPU‑bound or I/O‑bound.
perf‑top example
$ perf top
Events: 8K cycles
98.67% t2 [.] main
1.10% [kernel] [k] __do_softirq
...This instantly shows that the infinite‑loop program t2 dominates CPU usage.
perf‑record and perf‑report with call‑graph
perf record -e cpu-clock -g ./t1
perf reportOutput highlights the hot function longa and, after folding the call graph, reveals that 91.85 % of the time is spent in foo1 (which calls longa 100 times) while foo2 contributes only 8.15 %.
Using a tracepoint
# Count sys_enter events while running ls
perf stat -e raw_syscalls:sys_enter lsResult: 111 sys_enter events in 0.001557 s.
3. Common Commands in Detail
perf list – lists all hardware, software and tracepoint events (e.g., cpu-cycles, cache-misses, context-switches).
perf stat – reports a suite of counters after program execution. Example:
$ perf stat ls
Performance counter stats for 'ls':
0.653782 task-clock (msec) # 0.691 CPUs utilized
0 context-switches # 0.000 K/sec
0 CPU-migrations # 0.000 K/sec
247 page-faults # 0.378 M/sec
1,625,426 cycles # 2.486 GHz
1,050,293 stalled-cycles-frontend # 64.62% frontend cycles idle
838,781 stalled-cycles-backend # 51.60% backend cycles idle
1,055,735 instructions # 0.65 insns per cycle
210,587 branches # 322.106 M/sec
10,809 branch-misses # 5.13% of all branches
0.000945883 seconds time elapsedperf top – live view of hottest symbols. Options: -p <PID> – focus on a specific process. -e <event> – monitor a particular metric such as cache-misses. -a – aggregate data from all CPUs. -K – hide kernel symbols.
perf record – captures sampling data to perf.data. Example:
$ perf record -g lsperf report – analyses perf.data. Interactive navigation lets you drill into functions, view assembly, and explore call stacks. Additional filters such as -d, -C, -S, -U, -g refine the view.
4. Application Scenarios and Importance
Perf is essential for locating performance problems:
High CPU utilisation – use perf top or perf stat to find hot functions.
Excessive cache misses – inspect cache-misses and related events to optimise data access patterns.
Memory‑I/O bottlenecks – monitor page-faults, dTLB-loads, dTLB-load-misses to assess memory efficiency.
Function‑level hotspots – call‑graph visualisation (flame graphs) reveals which call paths dominate execution time.
Potential memory leaks – prolonged growth of memory‑related counters can hint at leaks.
By analysing IPC (instructions per cycle), cache utilisation, and call stacks, developers can decide whether to refactor algorithms, reduce unnecessary allocations, or adjust data structures.
5. Typical Performance‑Testing Workflow
Requirement analysis.
Prepare test scripts.
Execute tests.
Collect results.
Analyse problems.
6. Using Perf + Flame Graphs to Pinpoint Hot Functions
Steps to generate a flame graph:
Run a stress test until the program reaches its performance knee.
Convert the raw data to an unfolded text file: perf -i perf.data > perf.unfold Fold identical call stacks:
./stackcollapse-perf.pl perf.unfold > perf.foldedRender the SVG flame graph: ./flamegraph.pl perf.folded > perf.svg The resulting SVG visualises which functions consume the most CPU time, allowing quick identification of optimisation targets.
Perf works directly with native C/C++ binaries; compiling with debug symbols yields richer information. For languages such as Java or Go, language‑specific agents generate compatible perf data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
