Operations 16 min read

Using Linux perf for Performance Profiling and Analysis

This article introduces Linux perf, explains how to install it, demonstrates basic commands such as perf‑list, perf‑stat, perf‑top, perf‑record and perf‑report, and shows how to combine perf with flame‑graphs to locate CPU‑bound hotspots and other performance bottlenecks in applications and the kernel.

Deepin Linux

May 22, 2024

Using Linux perf for Performance Profiling and Analysis

Introduction: Perf is a powerful Linux performance analysis tool used to collect and analyze system performance data, helping developers locate bottlenecks and optimize code.

1. Perf Overview

Since kernel 2.6.31, Linux includes perf, which can perform function‑level and instruction‑level hotspot detection using PMU, tracepoints, and special kernel counters. It can profile user applications, the kernel, or both, providing a comprehensive view of performance issues.

Perf is built into the Linux kernel source tree and works on an event‑sampling principle, focusing on performance events for both processor‑related and OS‑related metrics.

1.1 Installing Perf

Installation is straightforward on kernels newer than 2.6.31. First install the kernel source: apt-get install linux-source Extract the source in /usr/src, navigate to tools/perf, and run:

make
make install

If required, install development packages first:

apt-get install -y binutils-dev
apt-get install -y libdw-dev
apt-get install -y python-dev
apt-get install -y libnewt-dev

1.2 Basic Usage of Perf

CPU cycles (cpu‑cycles) is the default event, representing the smallest time unit a CPU can recognize.

The generic command format is perf COMMAND [-e event ...] PROGRAM, where COMMAND can be top, stat, record, report, etc., and multiple events are specified with multiple -e options.

Perf includes 22 sub‑tools; the most common five are:

perf‑list

perf‑stat

perf‑top

perf‑record

perf‑report

perf‑trace

perf‑list shows all supported events (hardware, software, cache, tracepoint, etc.).

perf list [hw | sw | cache | tracepoint | event_glob]

perf‑stat example with a sample program t1.c that contains a long loop:

//t1.c
void longa()
{
  int i,j;
  for(i = 0; i < 1000000; i++)
    j=i;
}

void foo2()
{ int i; for(i=0 ; i < 10; i++) longa(); }

void foo1()
{ int i; for(i = 0; i< 100; i++) longa(); }

int main(void) { foo1(); foo2(); }

Compile and run: gcc -o t1 -g t1.c Running perf stat ./t1 yields statistics such as task‑clock, cycles, instructions, IPC, cache‑misses, etc.

Performance counter stats for './t1':
        218.584169 task-clock # 0.997 CPUs utilized
        771,180,100 cycles # 3.528 GHz
        550,703,114 instructions # 0.71 insns per cycle
        5,009 branch-misses # 0.00% of all branches
        0.219155248 seconds time elapsed

Key metrics include task‑clock (CPU utilization), context‑switches, cache‑misses, CPU‑migrations, cycles, IPC, and cache references.

Using -e you can change the default events to focus on specific metrics.

Programs can be CPU‑bound (high CPU utilization) or I/O‑bound (low CPU utilization); optimization strategies differ accordingly.

perf‑top provides a real‑time view of the most CPU‑intensive functions, similar to top but for profiling.

Events: 8K cycles
 98.67% t2 [.] main
  1.10% [kernel] __do_softirq
  0.07% [kernel] _raw_spin_unlock_irqrestore

perf‑record and perf‑report allow deeper analysis by recording per‑function statistics and displaying them with perf report. Example:

perf record -e cpu-clock ./t1
perf report

Output shows the hotspot (e.g., longa()) and, with the -g option, the call graph:

Events: 270 cpu-clock
- 100.00% t1 t1 [.] longa
   - longa
      + 91.85% foo1
      + 8.15% foo2

Tracepoints can be used to sample kernel behavior, such as counting system calls with raw_syscalls:sys_enter:

perf stat -e raw_syscalls:sys_enter ls

2. Common Performance Problem Analysis

Typical performance testing steps: requirement analysis, script preparation, test execution, result collection, and problem analysis.

Examples of real‑world issues include backend service bottlenecks, CPU saturation caused by inefficient fuzzy‑matching functions, MySQL write latency, and network bandwidth saturation.

2.1 Using Perf + Flamegraph to Locate Hot Functions

Perf can generate flame‑graphs to visualize where time is spent. The workflow is:

Run a stress test and record data: sudo perf record -e cpu-clock -g -p <pid> Convert the data: perf -i perf.data > perf.unfold Collapse symbols: ./stackcollapse-perf.pl perf.unfold > perf.folded Generate SVG: ./flamegraph.pl perf.folded > perf.svg The resulting flame‑graph highlights the longest‑running functions, enabling targeted optimization.

Additional resources and recommended reading (links) are listed at the end of the original document.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux CPU perf flamegraph

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.