Operations 16 min read

Profiling NFS I/O with eBPF: From Perf Events to Pyroscope Flame Graphs

This article explains how to use eBPF and perf events to trace NFS read/write system calls, capture kernel and user stack traces, resolve them to source locations via DWARF, and generate flame‑graph data that is pushed to Pyroscope for performance analysis.

Linux Kernel Journey

Jan 6, 2025

Profiling NFS I/O with eBPF: From Perf Events to Pyroscope Flame Graphs

Overall Architecture

NFS Profiler is built on eBPF and uses a two‑module design to trace the full lifecycle of NFS requests and collect performance data. The request‑tracking module injects kprobes or tracepoints into the critical NFS functions ( nfs_initiate_read/write and nfs_readpage_done/writeback_done) to record request start and end events. A unique request_id is generated by left‑shifting the timestamp 32 bits and OR‑ing it with the process ID, then stored in a BPF_MAP_TYPE_HASH called active_requests (key = PID, value = request_id).

Core Data Structures

Kernel‑side

struct stack_sample {
    u64 request_id;      // NFS request unique identifier
    u64 timestamp;       // sampling timestamp
    u32 pid;             // process ID
    u32 cpu;             // CPU ID
    s64 kernel_stack_id; // kernel‑mode stack ID
    s64 user_stack_id;   // user‑mode stack ID
};

Contains complete request context.

Links to actual stack data via stack_id.

Supports both kernel‑mode and user‑mode stack tracing.

struct {
    __type(key, u32);   // pid
    __type(value, u64); // request_id
    __uint(max_entries, 10000);
} active_requests;

Uses PID as the key for fast lookup.

Request ID is a timestamp‑PID combination.

Supports up to 10,000 concurrent requests.

struct {
    __uint(type, BPF_MAP_TYPE_STACK_TRACE);
    __uint(value_size, PERF_MAX_STACK_DEPTH * sizeof(u64));
    __uint(max_entries, 10000);
} stack_traces;

Dedicated STACK_TRACE map type.

Maximum stack depth of 127 frames.

Efficient storage and retrieval of stack information.

User‑side

type StackFrame struct {
    FuncName   string // e.g., "nfs_write_page"
    FileName   string // e.g., "fs/nfs/write.c"
    LineNumber uint   // source line number
    Offset     uint64 // offset inside the function
    Samples    int    // number of samples for this frame
}

type Stack struct {
    Frames []StackFrame // top‑to‑bottom stack frames
    Count  int          // total samples for the whole stack
}

type Symbol struct {
    Name    string
    Address uint64
    Size    uint64
}

type LineInfo struct {
    File string
    Line uint
    Func string
}

type SymbolCache struct {
    cache map[uint64]*Symbol
    mu    sync.RWMutex
}

type SampleEvent struct {
    Timestamp   uint64
    ProcessID   uint32
    ThreadID    uint32
    CPU         uint32
    KernelStack []uint64
    UserStack   []uint64
    RequestID   uint64
}

type ProfileData struct {
    StartTime int64
    EndTime   int64
    Samples   map[string]*Stack // aggregated stack data
    TotalCount int               // total number of samples
}

Workflow

Request Lifecycle Management

Request start

Record request at nfs_initiate_read/write.

Generate unique request_id.

Update active_requests map.

Request end

Clear entry at nfs_readpage_done/writeback_done.

Delete the record from active_requests to release resources.

Performance Sampling Process

Trigger mechanism : perf event fires at a configurable frequency, providing a low‑overhead sampling strategy.

Sampling handling :

Check whether the current PID has an active NFS request.

Obtain kernel‑mode and user‑mode stack IDs via bpf_get_stackid().

Construct a stack_sample containing request context.

Data transfer : use a PERF_EVENT_ARRAY (named cpu_profiler_events) to send the sample to user space, supporting efficient batch transmission and data integrity.

Sampling Rate

A rate of 1000 Hz means one sample every millisecond; 100 Hz means one sample every 10 ms. Higher rates capture finer‑grained performance data but increase system overhead and data volume. Typical choices are 100 Hz for regular monitoring and 1000 Hz for detailed analysis.

params := client.PyroscopeParams{
    Name:       "nfs.cpu",
    SampleRate: 1000, // 1000 Hz = 1 ms interval
}

Filtering Non‑NFS Traffic

The eBPF program looks up the PID in active_requests. If no entry is found, the sample is discarded, ensuring that only NFS‑related traffic is processed.

SEC("perf_event")
int do_perf_event(struct bpf_perf_event_data *ctx) {
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 *request_id = bpf_map_lookup_elem(&active_requests, &pid);
    if (!request_id) {
        return 0; // not an NFS request
    }
    // process the sample …
    return 0;
}

Stack Reconstruction

Raw kernel addresses are translated to human‑readable symbols, file names, and line numbers using DWARF debug information. The conversion relies on the Asphaltt/bpflbr and Asphaltt/addr2line repositories.

Address location : binary search finds the greatest address ≤ target, which marks the function start.

Function name resolution : retrieve the original name from the symbol table; demangle C++ names when needed.

DWARF processing : locate the compilation unit, detect inlined sub‑routines, and handle special tags.

Line information : use a line reader to map the address to a source line, handling inlined functions and zero‑line cases.

File handling : normalize paths, apply build‑directory prefixes, and produce absolute file names.

Result assembly : combine address, resolved function name, normalized file path, line number, and inlined‑function flag into a complete entry.

The accuracy of this process depends on the availability of kernel debug symbols (vmlinux with DWARF), correct KASLR offset calculation, and proper build‑directory identification.

Installing vmlinux Debug Packages (Ubuntu 24.04 example)

# Show current kernel version
uname -r   # e.g., 6.5.0-21-generic

# Install matching debug symbols
sudo apt install linux-image-$(uname -r)-dbgsym

# If the package is not found, enable the dbgsym repository
sudo apt install ubuntu-dbgsym-keyring
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | \
    sudo tee /etc/apt/sources.list.d/ddebs.list

sudo apt update
sudo apt install linux-image-$(uname -r)-dbgsym

KASLR Offset Calculation

KASLR randomizes the kernel load address at boot. To map runtime addresses to compile‑time symbols, the offset is computed as textAddr - stext, where textAddr is the .text base from the vmlinux file and stext is read from /proc/kallsyms.

// 1. Read compile‑time .text address from vmlinux
textAddr, err := bpf.ReadTextAddrFromVmlinux(vmlinux)
if err != nil {
    return fmt.Errorf("read .text addr: %w", err)
}

// 2. Read runtime address from /proc/kallsyms
stext := kallsyms.Stext()

// 3. Compute KASLR offset
kaslrOffset := textAddr - stext

References

[1] Asphaltt/bpflbr: https://github.com/Asphaltt/bpflbr/blob/bpflbr/internal/bpflbr/addr2line.go

[2] Asphaltt/addr2line: https://github.com/Asphaltt/addr2line

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Pyroscope flame graph eBPF Profiling kernel tracing NFS perf_event

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.