Profiling NFS I/O with eBPF: From Perf Events to Pyroscope Flame Graphs
This article explains how to use eBPF and perf events to trace NFS read/write system calls, capture kernel and user stack traces, resolve them to source locations via DWARF, and generate flame‑graph data that is pushed to Pyroscope for performance analysis.
Overall Architecture
NFS Profiler is built on eBPF and uses a two‑module design to trace the full lifecycle of NFS requests and collect performance data. The request‑tracking module injects kprobes or tracepoints into the critical NFS functions ( nfs_initiate_read/write and nfs_readpage_done/writeback_done) to record request start and end events. A unique request_id is generated by left‑shifting the timestamp 32 bits and OR‑ing it with the process ID, then stored in a BPF_MAP_TYPE_HASH called active_requests (key = PID, value = request_id).
Core Data Structures
Kernel‑side
struct stack_sample {
u64 request_id; // NFS request unique identifier
u64 timestamp; // sampling timestamp
u32 pid; // process ID
u32 cpu; // CPU ID
s64 kernel_stack_id; // kernel‑mode stack ID
s64 user_stack_id; // user‑mode stack ID
};Contains complete request context.
Links to actual stack data via stack_id.
Supports both kernel‑mode and user‑mode stack tracing.
struct {
__type(key, u32); // pid
__type(value, u64); // request_id
__uint(max_entries, 10000);
} active_requests;Uses PID as the key for fast lookup.
Request ID is a timestamp‑PID combination.
Supports up to 10,000 concurrent requests.
struct {
__uint(type, BPF_MAP_TYPE_STACK_TRACE);
__uint(value_size, PERF_MAX_STACK_DEPTH * sizeof(u64));
__uint(max_entries, 10000);
} stack_traces;Dedicated STACK_TRACE map type.
Maximum stack depth of 127 frames.
Efficient storage and retrieval of stack information.
User‑side
type StackFrame struct {
FuncName string // e.g., "nfs_write_page"
FileName string // e.g., "fs/nfs/write.c"
LineNumber uint // source line number
Offset uint64 // offset inside the function
Samples int // number of samples for this frame
}
type Stack struct {
Frames []StackFrame // top‑to‑bottom stack frames
Count int // total samples for the whole stack
}
type Symbol struct {
Name string
Address uint64
Size uint64
}
type LineInfo struct {
File string
Line uint
Func string
}
type SymbolCache struct {
cache map[uint64]*Symbol
mu sync.RWMutex
}
type SampleEvent struct {
Timestamp uint64
ProcessID uint32
ThreadID uint32
CPU uint32
KernelStack []uint64
UserStack []uint64
RequestID uint64
}
type ProfileData struct {
StartTime int64
EndTime int64
Samples map[string]*Stack // aggregated stack data
TotalCount int // total number of samples
}Workflow
Request Lifecycle Management
Request start
Record request at nfs_initiate_read/write.
Generate unique request_id.
Update active_requests map.
Request end
Clear entry at nfs_readpage_done/writeback_done.
Delete the record from active_requests to release resources.
Performance Sampling Process
Trigger mechanism : perf event fires at a configurable frequency, providing a low‑overhead sampling strategy.
Sampling handling :
Check whether the current PID has an active NFS request.
Obtain kernel‑mode and user‑mode stack IDs via bpf_get_stackid().
Construct a stack_sample containing request context.
Data transfer : use a PERF_EVENT_ARRAY (named cpu_profiler_events) to send the sample to user space, supporting efficient batch transmission and data integrity.
Sampling Rate
A rate of 1000 Hz means one sample every millisecond; 100 Hz means one sample every 10 ms. Higher rates capture finer‑grained performance data but increase system overhead and data volume. Typical choices are 100 Hz for regular monitoring and 1000 Hz for detailed analysis.
params := client.PyroscopeParams{
Name: "nfs.cpu",
SampleRate: 1000, // 1000 Hz = 1 ms interval
}Filtering Non‑NFS Traffic
The eBPF program looks up the PID in active_requests. If no entry is found, the sample is discarded, ensuring that only NFS‑related traffic is processed.
SEC("perf_event")
int do_perf_event(struct bpf_perf_event_data *ctx) {
u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 *request_id = bpf_map_lookup_elem(&active_requests, &pid);
if (!request_id) {
return 0; // not an NFS request
}
// process the sample …
return 0;
}Stack Reconstruction
Raw kernel addresses are translated to human‑readable symbols, file names, and line numbers using DWARF debug information. The conversion relies on the Asphaltt/bpflbr and Asphaltt/addr2line repositories.
Address location : binary search finds the greatest address ≤ target, which marks the function start.
Function name resolution : retrieve the original name from the symbol table; demangle C++ names when needed.
DWARF processing : locate the compilation unit, detect inlined sub‑routines, and handle special tags.
Line information : use a line reader to map the address to a source line, handling inlined functions and zero‑line cases.
File handling : normalize paths, apply build‑directory prefixes, and produce absolute file names.
Result assembly : combine address, resolved function name, normalized file path, line number, and inlined‑function flag into a complete entry.
The accuracy of this process depends on the availability of kernel debug symbols (vmlinux with DWARF), correct KASLR offset calculation, and proper build‑directory identification.
Installing vmlinux Debug Packages (Ubuntu 24.04 example)
# Show current kernel version
uname -r # e.g., 6.5.0-21-generic
# Install matching debug symbols
sudo apt install linux-image-$(uname -r)-dbgsym
# If the package is not found, enable the dbgsym repository
sudo apt install ubuntu-dbgsym-keyring
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | \
sudo tee /etc/apt/sources.list.d/ddebs.list
sudo apt update
sudo apt install linux-image-$(uname -r)-dbgsymKASLR Offset Calculation
KASLR randomizes the kernel load address at boot. To map runtime addresses to compile‑time symbols, the offset is computed as textAddr - stext, where textAddr is the .text base from the vmlinux file and stext is read from /proc/kallsyms.
// 1. Read compile‑time .text address from vmlinux
textAddr, err := bpf.ReadTextAddrFromVmlinux(vmlinux)
if err != nil {
return fmt.Errorf("read .text addr: %w", err)
}
// 2. Read runtime address from /proc/kallsyms
stext := kallsyms.Stext()
// 3. Compute KASLR offset
kaslrOffset := textAddr - stextReferences
[1] Asphaltt/bpflbr: https://github.com/Asphaltt/bpflbr/blob/bpflbr/internal/bpflbr/addr2line.go
[2] Asphaltt/addr2line: https://github.com/Asphaltt/addr2line
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
