Operations 35 min read

How Perf Works: Inside Linux Kernel’s Powerful Tracing and Profiling Tool

This article explains the Linux kernel’s perf utility, covering its architecture, key features such as lightweight event sampling, tracing, profiling and debugging, step‑by‑step installation, common commands with real code examples, and how to use perf and flame graphs to locate and optimise performance bottlenecks.

Linux Kernel Journey

Apr 3, 2025

How Perf Works: Inside Linux Kernel’s Powerful Tracing and Profiling Tool

In the vast landscape of the Linux kernel, performance optimisation and fault diagnosis are two towering challenges for developers. The perf tool stands out as a "magic" instrument that can precisely probe every corner of the kernel, from CPU cycles and cache‑hit rates to function call frequencies and context switches.

1. Perf Overview

Perf (short for Performance) is an event‑driven profiling suite built into the Linux kernel. It captures hardware, software and kernel tracepoint events, allowing developers to observe the system’s inner workings.

1.2 Core Capabilities

Lightweight event sampling : Uses hardware performance counters to sample processor events such as instruction count, cache hits/misses and branch‑prediction success, providing a fine‑grained view of program behaviour.

Trace : Records function‑call chains for processes or the kernel, producing call‑graph visualisations (e.g., flame graphs) that reveal where time is spent.

Profiling : Generates per‑function statistics, including per‑line instruction counts, enabling developers to pinpoint hot code paths.

Benchmarking : Can be combined with tools like sysbench or fio to evaluate overall system performance under load.

Debugging : Works with gdb to attach detailed performance data to source lines, helping to locate the root cause of slowdowns.

2. Installing and Using Perf

2.1 Installation

On Debian/Ubuntu the tool is provided by linux-tools-common and linux-tools-$(uname -r):

sudo apt-get install linux-tools-common linux-tools-`uname -r`

On Red Hat/CentOS the package is called perf: sudo yum install perf If a package manager is unavailable, perf can be built from the kernel source:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cd linux/tools/perf
make
sudo cp perf /usr/local/bin/

Verify the installation with perf --version.

2.2 Basic Usage

The default event is cpu-cycles. The generic command format is:

perf <COMMAND> [-e <event> ...] <PROGRAM>

Perf bundles 22 sub‑tools; the most common five are:

perf‑list – show all supported events.

perf‑stat – collect summary statistics after a program finishes.

perf‑top – live view of the hottest functions.

perf‑record – capture detailed sampling data.

perf‑report – analyse the data produced by perf‑record.

perf‑trace – trace system calls and kernel events.

Example: perf‑stat on a simple program

//t1.c
void longa(){
  int i,j;
  for(i = 0; i < 1000000; i++)
    j=i;
}
void foo1(){ for(int i=0;i<100;i++) longa(); }
void foo2(){ for(int i=0;i<10;i++)  longa(); }
int main(){ foo1(); foo2(); }

Compile and run:

gcc -o t1 -g t1.c
root@ubuntu-test:~# perf stat ./t1

Sample output (excerpt):

Performance counter stats for './t1':
        218.584169 task-clock # 0.997 CPUs utilized
                18 context-switches # 0.000 M/sec
                 0 CPU-migrations # 0.000 M/sec
                82 page-faults # 0.000 M/sec
        771,180,100 cycles # 3.528 GHz
        550,703,114 instructions # 0.71 insns per cycle
        110,117,522 branches # 503.776 M/sec
                5,009 branch-misses # 0.00% of all branches
        0.219155248 seconds time elapsed
Program t1 is CPU‑bound because task‑clock is close to 1 s.

The report explains each metric (task‑clock, context‑switches, cache‑misses, etc.) and how they indicate whether a program is CPU‑bound or I/O‑bound.

perf‑top example

$ perf top
Events: 8K cycles
 98.67% t2 [.] main
  1.10% [kernel] [k] __do_softirq
  ...

This instantly shows that the infinite‑loop program t2 dominates CPU usage.

perf‑record and perf‑report with call‑graph

perf record -e cpu-clock -g ./t1
perf report

Output highlights the hot function longa and, after folding the call graph, reveals that 91.85 % of the time is spent in foo1 (which calls longa 100 times) while foo2 contributes only 8.15 %.

Using a tracepoint

# Count sys_enter events while running ls
perf stat -e raw_syscalls:sys_enter ls

Result: 111 sys_enter events in 0.001557 s.

3. Common Commands in Detail

perf list – lists all hardware, software and tracepoint events (e.g., cpu-cycles, cache-misses, context-switches).

perf stat – reports a suite of counters after program execution. Example:

$ perf stat ls
Performance counter stats for 'ls':
        0.653782 task-clock (msec) # 0.691 CPUs utilized
                0 context-switches # 0.000 K/sec
                0 CPU-migrations # 0.000 K/sec
              247 page-faults # 0.378 M/sec
        1,625,426 cycles # 2.486 GHz
        1,050,293 stalled-cycles-frontend # 64.62% frontend cycles idle
        838,781 stalled-cycles-backend # 51.60% backend cycles idle
        1,055,735 instructions # 0.65 insns per cycle
        210,587 branches # 322.106 M/sec
        10,809 branch-misses # 5.13% of all branches
        0.000945883 seconds time elapsed

perf top – live view of hottest symbols. Options: -p <PID> – focus on a specific process. -e <event> – monitor a particular metric such as cache-misses. -a – aggregate data from all CPUs. -K – hide kernel symbols.

perf record – captures sampling data to perf.data. Example:

$ perf record -g ls

perf report – analyses perf.data. Interactive navigation lets you drill into functions, view assembly, and explore call stacks. Additional filters such as -d, -C, -S, -U, -g refine the view.

4. Application Scenarios and Importance

Perf is essential for locating performance problems:

High CPU utilisation – use perf top or perf stat to find hot functions.

Excessive cache misses – inspect cache-misses and related events to optimise data access patterns.

Memory‑I/O bottlenecks – monitor page-faults, dTLB-loads, dTLB-load-misses to assess memory efficiency.

Function‑level hotspots – call‑graph visualisation (flame graphs) reveals which call paths dominate execution time.

Potential memory leaks – prolonged growth of memory‑related counters can hint at leaks.

By analysing IPC (instructions per cycle), cache utilisation, and call stacks, developers can decide whether to refactor algorithms, reduce unnecessary allocations, or adjust data structures.

5. Typical Performance‑Testing Workflow

Requirement analysis.

Prepare test scripts.

Execute tests.

Collect results.

Analyse problems.

6. Using Perf + Flame Graphs to Pinpoint Hot Functions

Steps to generate a flame graph:

Run a stress test until the program reaches its performance knee.

Convert the raw data to an unfolded text file: perf -i perf.data > perf.unfold Fold identical call stacks:

./stackcollapse-perf.pl perf.unfold > perf.folded

Render the SVG flame graph: ./flamegraph.pl perf.folded > perf.svg The resulting SVG visualises which functions consume the most CPU time, allowing quick identification of optimisation targets.

Perf works directly with native C/C++ binaries; compiling with debug symbols yields richer information. For languages such as Java or Go, language‑specific agents generate compatible perf data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux benchmark tracing Profiling perf flamegraph

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.