Operations 19 min read

Understanding CPU Hardware Efficiency and Using Linux perf for Performance Monitoring

CPU efficiency depends on low CPI and high cache‑hit rates, which can be measured on Linux using the high‑level perf utility or the low‑level perf_event_open syscall to read hardware performance counters for cycles, instructions, and cache‑miss statistics, revealing how often the processor falls back to slower memory.

Java Tech Enthusiast

May 23, 2024

Understanding CPU Hardware Efficiency and Using Linux perf for Performance Monitoring

When discussing CPU performance, most people focus on CPU utilization, but instruction execution efficiency is equally important. Low efficiency means the CPU does a lot of work without producing useful results.

1. CPU Hardware Efficiency

The CPU consists of multiple cores, each with private registers and caches (L1‑data, L1‑code, L2) and a shared L3 cache. A service program fetches instructions and data from registers, L1/L2/L3 caches, and main memory.

Classic pipeline stages are fetch, decode, execute, and memory access. Memory latency (10‑40 ns) is far higher than a CPU cycle (≈0.3 ns), so caches act as a speed‑pyramid: the closer to the core, the faster but smaller the storage.

Fetch: load instruction into the instruction register.

Decode: translate the instruction and load operands into registers.

Execute: perform the operation and store the result in a register.

Memory access: read/write data between registers and memory.

Two main metrics evaluate efficiency:

CPI (cycles per instruction) – average number of clock cycles each instruction consumes.

Cache hit rate – proportion of memory accesses satisfied by caches; lower cache‑miss rates mean higher performance.

2. How to Evaluate CPU Hardware Efficiency

2.1 Using the perf Tool

Linux ships with the perf utility. # perf list hw cache lists hardware events that can be monitored:

# perf list hw cache
List of pre-defined events (to be used in -e):
  branch-instructions OR branches            [Hardware event]
  branch-misses                              [Hardware event]
  bus-cycles                                 [Hardware event]
  cache-misses                               [Hardware event]
  cache-references                           [Hardware event]
  cpu-cycles OR cycles                        [Hardware event]
  instructions                               [Hardware event]
  ref-cycles                                 [Hardware event]
  L1-dcache-load-misses                      [Hardware cache event]
  L1-dcache-loads                            [Hardware cache event]
  L1-dcache-stores                           [Hardware cache event]
  L1-icache-load-misses                      [Hardware cache event]
  dTLB-load-misses                           [Hardware cache event]
  dTLB-loads                                 [Hardware cache event]
  ...

Key counters for performance analysis are: cpu-cycles: total CPU cycles consumed. instructions: number of retired instructions (used with cycles to compute CPI). L1-dcache-loads / L1-dcache-load-misses: L1 data‑cache accesses and misses. dTLB-loads / dTLB-load-misses: data TLB accesses and misses.

Example: measuring a simple workload.

# perf stat sleep 5
Performance counter stats for 'sleep 5':
    1,758,466 cycles            # 2.575 GHz
      871,474 instructions      # 0.50 insn per cycle

From the output, IPC = 0.50, therefore CPI = 1 / 0.50 = 2 cycles per instruction.

Measuring cache‑miss rates:

# perf stat -e L1-dcache-load-misses,L1-dcache-loads,dTLB-load-misses,dTLB-loads sleep 5
    22,578 L1-dcache-load-misses # 10.22% of all L1-dcache accesses
   220,911 L1-dcache-loads
     2,101 dTLB-load-misses     # 0.95% of all dTLB cache accesses
   220,911 dTLB-loads

The miss percentages (10.22 % for L1‑data, 0.95 % for dTLB) indicate how often the CPU had to fall back to slower memory levels.

2.2 Directly Using Kernel System Calls

For custom scenarios you can bypass perf and use the perf_event_open syscall to create a perf file descriptor and read counters yourself.

int main()
{
    // Step 1: create perf fd
    struct perf_event_attr attr;
    attr.type = PERF_TYPE_HARDWARE;          // monitor hardware events
    attr.config = PERF_COUNT_HW_INSTRUCTIONS; // count retired instructions
    int fd = perf_event_open(&attr, 0, -1, -1, 0);

    // Step 2: periodic read loop
    while (1) {
        read(fd, &instructions, sizeof(instructions));
        // process the value …
    }
}

The syscall parameters pid=0 (current process) and cpu=-1 (all CPUs) are typical. The kernel returns a file descriptor that can be read like any regular file.

3. perf Internals

Under the hood, Linux registers a Performance Monitoring Unit (PMU) for each architecture. For x86 the PMU is defined in arch/x86/events/core.c and registered with perf_pmu_register(&pmu, "cpu", PERF_TYPE_RAW).

static struct pmu pmu = {
    .pmu_enable = x86_pmu_enable,
    .read       = x86_pmu_read,
    ...
};

The perf_event_open syscall allocates a file descriptor, creates a perf_event object, binds it to the appropriate PMU, and installs a file with perf_fops operations (read, ioctl, mmap).

SYSCALL_DEFINE5(perf_event_open,
        struct perf_event_attr __user *, attr_uptr,
        pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
{
    // 1. allocate fd
    event_fd = get_unused_fd_flags(f_flags);
    // 2. allocate and initialise event based on attr
    event = perf_event_alloc(&attr, cpu, task, group_leader, NULL, NULL, NULL, cgroup_fd);
    // 3. create anon inode file with perf_fops
    event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
    // 4. install into context and fd table
    fd_install(event_fd, event_file);
    return event_fd;
}

The read path goes through

perf_fops.read → perf_read → __perf_read → perf_event_read → x86_pmu_read

, which finally executes the rdpmcl instruction to fetch the counter value from the hardware PMC register.

static inline u64 x86_perf_event_update(struct perf_event *event)
{
    rdpmcl(hwc->event_base_rdpmc, new_raw_count);
    return new_raw_count;
}

Summary

CPU performance is governed by CPI and cache‑hit rates. Linux provides two practical ways to observe these metrics: the high‑level perf tool and the low‑level perf_event_open syscall. Both rely on hardware Performance Monitoring Counters (PMCs) built into modern CPUs, allowing precise, low‑overhead measurement of cycles, instructions, and cache behavior.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Monitoring CPU Cache Miss Rate CPI Linux perf perf_event_open pmu

Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.