Using Linux perf for Performance Profiling and Analysis
This article introduces Linux perf, explains how to install it, demonstrates basic commands such as perf‑list, perf‑stat, perf‑top, perf‑record and perf‑report, and shows how to combine perf with flame‑graphs to locate CPU‑bound hotspots and other performance bottlenecks in applications and the kernel.
Introduction: Perf is a powerful Linux performance analysis tool used to collect and analyze system performance data, helping developers locate bottlenecks and optimize code.
1. Perf Overview
Since kernel 2.6.31, Linux includes perf, which can perform function‑level and instruction‑level hotspot detection using PMU, tracepoints, and special kernel counters. It can profile user applications, the kernel, or both, providing a comprehensive view of performance issues.
Perf is built into the Linux kernel source tree and works on an event‑sampling principle, focusing on performance events for both processor‑related and OS‑related metrics.
1.1 Installing Perf
Installation is straightforward on kernels newer than 2.6.31. First install the kernel source:
apt-get install linux-sourceExtract the source in /usr/src , navigate to tools/perf , and run:
make
make installIf required, install development packages first:
apt-get install -y binutils-dev
apt-get install -y libdw-dev
apt-get install -y python-dev
apt-get install -y libnewt-dev1.2 Basic Usage of Perf
CPU cycles (cpu‑cycles) is the default event, representing the smallest time unit a CPU can recognize.
The generic command format is perf COMMAND [-e event ...] PROGRAM , where COMMAND can be top , stat , record , report , etc., and multiple events are specified with multiple -e options.
Perf includes 22 sub‑tools; the most common five are:
perf‑list
perf‑stat
perf‑top
perf‑record
perf‑report
perf‑trace
perf‑list shows all supported events (hardware, software, cache, tracepoint, etc.).
perf list [hw | sw | cache | tracepoint | event_glob]perf‑stat example with a sample program t1.c that contains a long loop:
//t1.c
void longa()
{
int i,j;
for(i = 0; i < 1000000; i++)
j=i;
}
void foo2()
{ int i; for(i=0 ; i < 10; i++) longa(); }
void foo1()
{ int i; for(i = 0; i< 100; i++) longa(); }
int main(void) { foo1(); foo2(); }Compile and run:
gcc -o t1 -g t1.cRunning perf stat ./t1 yields statistics such as task‑clock, cycles, instructions, IPC, cache‑misses, etc.
Performance counter stats for './t1':
218.584169 task-clock # 0.997 CPUs utilized
771,180,100 cycles # 3.528 GHz
550,703,114 instructions # 0.71 insns per cycle
5,009 branch-misses # 0.00% of all branches
0.219155248 seconds time elapsedKey metrics include task‑clock (CPU utilization), context‑switches, cache‑misses, CPU‑migrations, cycles, IPC, and cache references.
Using -e you can change the default events to focus on specific metrics.
Programs can be CPU‑bound (high CPU utilization) or I/O‑bound (low CPU utilization); optimization strategies differ accordingly.
perf‑top provides a real‑time view of the most CPU‑intensive functions, similar to top but for profiling.
Events: 8K cycles
98.67% t2 [.] main
1.10% [kernel] __do_softirq
0.07% [kernel] _raw_spin_unlock_irqrestoreperf‑record and perf‑report allow deeper analysis by recording per‑function statistics and displaying them with perf report . Example:
perf record -e cpu-clock ./t1
perf reportOutput shows the hotspot (e.g., longa() ) and, with the -g option, the call graph:
Events: 270 cpu-clock
- 100.00% t1 t1 [.] longa
- longa
+ 91.85% foo1
+ 8.15% foo2Tracepoints can be used to sample kernel behavior, such as counting system calls with raw_syscalls:sys_enter :
perf stat -e raw_syscalls:sys_enter ls2. Common Performance Problem Analysis
Typical performance testing steps: requirement analysis, script preparation, test execution, result collection, and problem analysis.
Examples of real‑world issues include backend service bottlenecks, CPU saturation caused by inefficient fuzzy‑matching functions, MySQL write latency, and network bandwidth saturation.
2.1 Using Perf + Flamegraph to Locate Hot Functions
Perf can generate flame‑graphs to visualize where time is spent. The workflow is:
Run a stress test and record data: sudo perf record -e cpu-clock -g -p
Convert the data: perf -i perf.data > perf.unfold
Collapse symbols: ./stackcollapse-perf.pl perf.unfold > perf.folded
Generate SVG: ./flamegraph.pl perf.folded > perf.svg
The resulting flame‑graph highlights the longest‑running functions, enabling targeted optimization.
Additional resources and recommended reading (links) are listed at the end of the original document.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.