Operations 36 min read

Inside Linux Perf: How the Kernel’s Powerful Tracing Tool Works

The article introduces Linux’s built‑in performance analysis tool perf, explains its event‑driven sampling, tracing and profiling capabilities, shows how to install it on various distributions, demonstrates common commands with real code examples, and discusses practical scenarios for locating and optimizing kernel and application performance issues.

Linux Code Review Hub
Linux Code Review Hub
Linux Code Review Hub
Inside Linux Perf: How the Kernel’s Powerful Tracing Tool Works

1. Perf Tool Overview

1.1 What is Perf

Perf (short for Performance) is a profiling tool integrated into the Linux kernel. It uses an event‑driven mechanism to capture hardware, software, and kernel‑level performance events, allowing developers to inspect system behavior in detail.

1.2 Powerful Features of Perf

Perf has evolved alongside the Linux kernel, expanding from simple monitoring to a rich set of capabilities. It can monitor hardware events such as CPU cycles and cache hits, software events like context switches and page faults, as well as kernel tracepoints and dynamic tracing.

(1) Lightweight Event Sampling

Perf can sample processor events via hardware performance counters, reporting metrics such as the number of executed instructions, cache hit/miss counts, and branch‑prediction success rates. These metrics act as key signals that reveal the underlying performance characteristics of a program.

(2) Trace Functionality

Perf’s trace feature records function‑call chains for processes or the kernel, producing call graphs and flame graphs that visualize execution time and stack depth. By examining a flame graph, developers can quickly spot functions that dominate CPU usage.

(3) Profiling Functionality

Perf can profile a specific application to identify the most time‑consuming functions and even individual source lines. It provides per‑line statistics, enabling developers to see precisely which code paths affect performance the most.

(4) Benchmarking

Perf can be combined with external benchmark tools such as sysbench and fio to evaluate overall system performance under different workloads, including CPU‑bound, memory‑bound, and I/O‑bound scenarios.

(5) Debugging Functionality

When used together with debuggers like gdb, Perf can capture trace data and correlate it with source code, showing function call frequencies, execution times, and variable changes, which helps pinpoint performance bottlenecks.

2. Installing and Using Perf

2.1 Installing Perf

On Debian‑based systems, Perf is provided by the linux-tools-common and linux-tools-`uname -r` packages:

sudo apt-get install linux-tools-common linux-tools-`uname -r`

On Red Hat or CentOS, install it with yum: sudo yum install perf If a package manager is unavailable, Perf can be built from the Linux kernel source:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cd linux/tools/perf
make

After compilation, copy the binary to a directory in $PATH: sudo cp perf /usr/local/bin/ Verify the installation with:

perf --version

2.2 Basic Usage of Perf

The default performance event is cpu-cycles, representing the smallest time unit the CPU can measure. Perf commands follow the pattern perf COMMAND [-e event …] PROGRAM, where COMMAND can be top, stat, record, report, etc., and multiple events are specified with repeated -e options.

Perf bundles 22 sub‑tools; the most common five are:

perf-list

perf-stat

perf-top

perf-record

perf-report

perf-trace

(1) perf-list

Lists all symbolic event types supported by the system, including hardware, software, and tracepoint events. perf list (2) perf-stat

Example program t1.c demonstrates a CPU‑bound workload. After compiling: gcc -o t1 -g t1.c Running perf stat ./t1 yields output such as:

Performance counter stats for './t1':
        218.584169 task-clock # 0.997 CPUs utilized
                18 context-switches # 0.000 M/sec
                 0 CPU-migrations # 0.000 M/sec
                82 page-faults # 0.000 M/sec
        771,180,100 cycles # 3.528 GHz
        550,703,114 instructions # 0.71 insns per cycle
        110,117,522 branches # 503.776 M/sec
                5,009 branch-misses # 0.00% of all branches
        0.219155248 seconds time elapsed
Program t1 is CPU‑bound because task‑clock is close to 1 second.

The default statistics include task‑clock, context‑switches, cache‑misses, CPU‑migrations, cycles, IPC, cache references, and more.

(3) perf-top

Running perf top while a simple infinite‑loop program t2.c is executing shows real‑time percentages, e.g.:

Events: 8K cycles
 98.67% t2 [.] main
  1.10% [kernel] [k] __do_softirq
  0.07% [kernel] [k] _raw_spin_unlock_irqrestore

This quickly identifies the hot function ( main in the example).

(4) perf-record and perf-report

To obtain a call‑graph, record with -g and then report:

perf record -e cpu-clock -g ./t1
perf report

The report shows that 100 % of the sampled time is spent in longa(), with 91.85 % of that time attributed to foo1() (which calls longa() 100 times) and 8.15 % to foo2().

(5) Tracepoint Example

Counting system calls for ls using the raw_syscalls:sys_enter tracepoint:

perf stat -e raw_syscalls:sys_enter ls
Performance counter stats for 'ls':
        111 raw_syscalls:sys_enter
        0.001557549 seconds time elapsed

2.3 Detailed Explanation of Common Commands

(1) List all measurable events perf list displays hardware events (e.g., cycles, instructions, cache-misses), software events (e.g., context-switches, page-faults), and tracepoints.

(2) Show statistics

Running perf stat ls produces a breakdown of task‑clock, cycles, instructions, branches, cache‑misses, etc., helping users understand the program’s performance profile.

(3) Real‑time system view perf top displays the functions consuming the most CPU cycles in descending order. Options such as -p <PID> limit the view to a specific process, -e cache-misses focuses on a particular event, -a aggregates all CPUs, and -K hides kernel symbols.

(4) Record and generate report perf record -g ls captures call‑graph data; perf report -i perf.data opens an interactive view where users can navigate functions, view assembly, and filter by DSO, command, or symbol.

3. Application Scenarios and Importance of Perf

3.1 Locating Performance Problems

Perf helps diagnose high CPU utilization by revealing which processes and functions dominate CPU cycles. In cache‑miss heavy workloads, perf stat can display L1/L2 cache statistics to guide data‑layout optimizations. For memory‑I/O bottlenecks, events such as page-faults, dTLB-loads, and dTLB-load-misses expose inefficient memory access patterns.

3.2 Keys to Performance Optimization

Analyzing IPC (instructions per cycle) indicates how well code utilizes the processor; low IPC suggests opportunities for instruction‑level improvements. Monitoring memory usage, allocation patterns, and cache behavior informs decisions about data structures and allocation strategies. Call‑stack tracing pinpoints hotspot functions, allowing targeted algorithmic or code‑level refinements.

4. Common Performance Problem Analysis

Typical performance testing follows these steps:

Requirement analysis

Script preparation

Test execution

Result aggregation

Problem analysis

A service loads a 1 GB word list into memory, performs fuzzy matching on incoming requests, forwards matches to a backend HTTP service, returns the response, and records a request identifier and count in MySQL.

Key functions: fuzzyMatching, sendingRequest, buildResponse, signNum (MySQL counter)

Four test groups illustrate typical bottlenecks and Perf‑guided analyses:

Group 1: Random requests at 1 k QPS show no CPU, memory, or bandwidth saturation; the limitation is likely the backend service. Perf can confirm that the backend is the bottleneck.

Group 2: After fixing the backend, 30‑character requests cause CPU load to max out at 400 QPS. Perf + flame graphs reveal that fuzzyMatching consumes the majority of CPU cycles, suggesting code‑level optimization.

Group 3: With backend and matching optimized, random requests reach 3 k QPS but then drop to 1 k QPS intermittently. Perf shows increased MySQL latency, indicating that direct database writes at high concurrency are a scalability issue; a cache layer (e.g., Redis) would be advisable.

Group 4: Replacing the backend with a real service caps QPS at 300 due to network bandwidth saturation. Perf helps verify that the network interface is the limiting factor and guides decisions about bandwidth upgrades or request sharding.

Using Perf together with flame graphs provides a visual way to locate time‑consuming functions:

Record data while stressing the program:

perf record -e cpu-clock -g -p 11110 -o data/perf.data sleep 30

Unfold the raw data: perf -i perf.data > perf.unfold Collapse symbols: ./stackcollapse-perf.pl perf.unfold > perf.folded Generate an SVG flame graph: ./flamegraph.pl perf.folded > perf.svg The resulting flame graph highlights the functions that dominate execution time, enabling developers to focus optimization effort where it matters most. Native Perf works best with C/C++ programs compiled with debug symbols; other languages (Java, Go, etc.) require language‑specific tooling that produces compatible perf data.

system optimizationperformance profilingbenchmarkingkernel tracingperfflamegraph
Linux Code Review Hub
Written by

Linux Code Review Hub

A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.