How to Trace Intel NPU Kernel Driver Operations Using eBPF and bpftrace
This tutorial explains how to use eBPF and bpftrace to monitor the Intel NPU kernel driver on Lunar Lake and Meteor Lake CPUs, mapping Level Zero API calls to kernel ioctls, tracking memory allocation, IPC communication, and identifying performance bottlenecks through detailed function‑call statistics.
Neural Processing Units (NPUs) are the next frontier for AI acceleration, embedded directly in modern CPUs. Intel's Lunar Lake and Meteor Lake processors include dedicated NPU hardware, but when AI models run slowly, inference fails, or memory allocation crashes, debugging is nearly impossible because the NPU driver is a black box.
Intel NPU Driver Architecture
The driver follows a two‑layer design similar to GPU drivers. The kernel module intel_vpu lives in drivers/accel/ivpu/ and exposes /dev/accel/accel0 as a device node. It handles hardware communication, MMU‑based memory management, and IPC with the NPU firmware.
The user‑space library libze_intel_vpu.so implements the Level Zero API. Calls such as zeMemAllocHost() or zeCommandQueueExecuteCommandLists() are translated into DRM ioctls that the kernel validates, map memory, submit work to the firmware, and poll for completion.
The firmware runs autonomously on the accelerator, receiving command buffers, scheduling compute kernels, managing on‑chip memory, and signalling completion via interrupts. Correct coordination among the application, kernel driver, and firmware is essential for successful inference.
Mapping Level Zero API to Kernel Operations
Using a simple matrix‑multiplication workload, the tutorial shows how each Level Zero call maps to kernel functions:
zeMemAllocHost triggers DRM_IOCTL_IVPU_BO_CREATE, which calls ivpu_bo_create_ioctl(), then ivpu_gem_create_object(), ivpu_mmu_context_map_page() (page mapping), and finally ivpu_bo_pin() to pin the buffer.
Three allocations for matrices A, B, and C result in three zeMemAllocHost() calls and about 1,377 ivpu_mmu_context_map_page() invocations each, totaling 4,131 page mappings.
zeCommandQueueCreate maps to DRM_IOCTL_IVPU_GET_PARAM via ivpu_get_param_ioctl() to query queue capabilities.
zeCommandListCreate builds the command list entirely in user space; no kernel call occurs.
zeCommandQueueExecuteCommandLists triggers DRM_IOCTL_IVPU_SUBMIT → ivpu_submit_ioctl(), which validates the command buffer, sets up DMA, and sends an IPC message to the firmware. The firmware wakes, schedules the compute kernel, and generates IPC interrupts.
zeFenceHostSynchronize polls ivpu_get_param_ioctl() for fence status; when the firmware signals completion, an ivpu_ipc_irq_handler() is invoked.
Using bpftrace to Trace NPU Operations
A complete bpftrace script attaches kprobes to every function in the intel_vpu module (all symbols prefixed with ivpu_). It prints timestamps and function names, and aggregates call counts in a map @calls for later ranking.
#!/usr/bin/env bpftrace
BEGIN {
printf("Tracing Intel NPU kernel driver... Press Ctrl-C to stop.
");
printf("%-10s %-40s
", "Time(ms)", "Function");
}
kprobe:intel_vpu:ivpu_* {
printf("%-10llu %-40s
", nsecs/1000000, probe);
@calls[probe] = count();
}
END {
printf("
=== Intel NPU function call statistics ===
");
printf("
Top 20 functions by call count:
");
print(@calls, 20);
}Running the script while the matrix‑multiplication workload executes produces a chronological trace of kernel‑side activity, including device open, MMU context initialization, parameter queries, buffer creation, page mapping, command submission, IPC handling, and cleanup.
Interpreting the Trace
Typical output shows:
Device initialization via ivpu_open() and ivpu_mmu_context_init().
Memory allocation pattern: each zeMemAllocHost() leads to one ivpu_bo_create_ioctl() and ~1,377 ivpu_mmu_context_map_page() calls.
Command submission via ivpu_submit_ioctl(), followed by a burst of IPC activity: ~946 ivpu_ipc_irq_handler(), 945 ivpu_ipc_receive(), and 951 ivpu_hw_ip_ipc_rx_count_get() calls.
Cleanup calls such as ivpu_postclose(), ivpu_ms_cleanup(), and ivpu_pgtable_free_page() (517 calls to free the 4,131 page mappings).
Aggregating the call counts across a full run (8,198 total calls) reveals three dominant categories:
Memory management : 4,648 calls (57%), primarily ivpu_mmu_context_map_page().
IPC communication : 2,842 calls (35%), consisting of the three IPC‑related functions above.
Buffer management : 74 calls (<1%), covering ivpu_bo_create_ioctl(), ivpu_gem_create_object(), and ivpu_bo_pin().
Deviations from these ratios can indicate problems: an unusually high IPC count may mean the firmware is stuck in a retry loop; excessive page‑mapping calls suggest inefficient memory allocation.
Running the Tracing Tool
Prerequisites:
Linux kernel 6.2+ with the intel_vpu driver built.
Intel NPU hardware (Meteor Lake or Lunar Lake).
bpftrace installed (e.g., apt install bpftrace).
Root privileges to attach kprobes.
Typical workflow:
# Verify driver is loaded
lsmod | grep intel_vpu
ls -l /dev/accel/accel0
modinfo intel_vpu
# Run the full script
sudo bpftrace trace_npu.bt
# Or a quick one‑liner
sudo bpftrace -e 'kprobe:intel_vpu:ivpu_* { printf("%s
", probe); }' > trace.txtWhile the script runs, launch any Level Zero, OpenVINO, or other NPU workload that uses /dev/accel/accel0. After stopping the trace, use sort | uniq -c | sort -rn | head -20 to list the hottest kernel functions.
Advanced Analysis Techniques
Filter specific operations or measure latency, for example, to profile buffer allocation time:
sudo bpftrace -e '
kprobe:intel_vpu:ivpu_bo_create_ioctl { @alloc_time[tid] = nsecs; }
kretprobe:intel_vpu:ivpu_bo_create_ioctl /@alloc_time[tid]/ {
$lat = (nsecs - @alloc_time[tid]) / 1000;
printf("Allocation latency %llu us
", $lat);
delete(@alloc_time[tid]);
@lat_hist = hist($lat);
}
END { print(@lat_hist); }
'Similarly, monitor IPC message rate to detect firmware stalls:
sudo bpftrace -e '
kprobe:intel_vpu:ivpu_ipc_receive { @ipc_count++; }
interval:s:1 { printf("IPC msgs/sec: %llu
", @ipc_count); @ipc_count = 0; }
'Use uprobe on libze_intel_vpu.so to correlate user‑space API calls with kernel ioctls and firmware IPC events, revealing the full control flow across all three layers.
Understanding Intel VPU Kernel Symbols
The module exports 1,312 symbols in /proc/kallsyms, categorized as:
t (text) : functions such as ivpu_submit_ioctl, ivpu_mmu_context_map_page.
d (data) : global variables and structures.
r (read‑only) : constant data and strings.
b (BSS) : uninitialized data allocated at load time.
The driver provides functionality via:
DRM device file interface ( /dev/accel/accel0).
Standard DRM ioctls for buffer management.
Custom ioctls for NPU‑specific operations.
IPC protocol with the firmware.
Key function families include ivpu_bo_* (buffer objects), ivpu_mmu_* (MMU), ivpu_ipc_* (IPC), ivpu_hw_* (hardware), ivpu_fw_* (firmware), and ivpu_pm_* (power management). The full list is in intel_vpu_symbols.txt for targeted tracing.
References
Intel NPU driver source: https://github.com/intel/linux-npu-driver
Linux kernel accelerator subsystem: drivers/accel/ Intel VPU kernel module: drivers/accel/ivpu/ DRM subsystem documentation: Documentation/gpu/drm-uapi.rst bpftrace reference guide: https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md
Tutorial repository: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/xpu/npu-kernel-driver
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
