How to Use Kernel Tracepoints for Zero‑Overhead GPU Driver Monitoring
This tutorial explains how to leverage Linux kernel tracepoints with eBPF and bpftrace to capture real‑time GPU driver activity—including job scheduling, memory management, and command submission—across Intel, AMD, Nouveau, and NVIDIA GPUs, providing detailed examples, scripts, and analysis of the resulting data.
Zero‑overhead GPU driver observability
GPU stalls in games or machine‑learning workloads often originate inside the kernel driver. Linux kernel tracepoints emit nanosecond‑resolution events exactly when the driver schedules a job, allocates memory, or emits a fence, capturing 100 % of activity with negligible overhead.
DRM scheduler tracepoints (vendor‑neutral)
The gpu_scheduler tracepoint group is a stable uAPI present in Intel i915, AMD AMDGPU, Nouveau and other DRM drivers. It provides three events: drm_run_job – job moves from software queue to hardware execution. drm_sched_process_job – job finishes and its fence is signaled. drm_sched_job_wait_dep – job blocks waiting for a dependency.
Full bpftrace script drm_scheduler.bt
#!/usr/bin/env bpftrace
BEGIN {
printf("Tracing DRM GPU scheduler... Press Ctrl-C to end.
");
printf("%-18s %-12s %-16s %-12s %-8s %s
",
"Time(ms)", "Event", "JobID", "Ring", "Queued", "Details");
}
tracepoint:gpu_scheduler:drm_run_job {
$job_id = args->id;
$ring = str(args->name);
$queue = args->job_count;
$hw_q = args->hw_job_count;
@start[$job_id] = nsecs;
printf("%-18llu %-12s %-16llu %-12s %-8u hw=%d
",
nsecs/1000000, "RUN", $job_id, $ring, $queue, $hw_q);
@jobs_per_ring[$ring] = count();
}
tracepoint:gpu_scheduler:drm_sched_process_job {
$fence = args->fence;
printf("%-18llu %-12s %-16p
", nsecs/1000000, "COMPLETE", $fence);
@completion_count = count();
}
tracepoint:gpu_scheduler:drm_sched_job_wait_dep {
$job_id = args->id;
$ring = str(args->name);
$ctx = args->ctx;
$seq = args->seqno;
printf("%-18llu %-12s %-16llu %-12s %-8s ctx=%llu seq=%u
",
nsecs/1000000, "WAIT_DEP", $job_id, $ring, "-", $ctx, $seq);
@wait_count = count();
@waits_per_ring[$ring] = count();
}
END {
printf("
=== DRM Scheduler Statistics ===
");
printf("
Jobs per ring:
");
print(@jobs_per_ring);
printf("
Waits per ring:
");
print(@waits_per_ring);
}The script records the start timestamp of each job ( @start[$job_id]) so that the duration can be computed later as nsecs - @start[$job_id]. Per‑ring counters ( @jobs_per_ring and @waits_per_ring) reveal workload distribution and dependency stalls.
Intel i915 low‑level tracepoints
Enable CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y to expose: i915_gem_object_create – GEM object allocation (e.g. obj=0xffff888... size=0x100000). i915_vma_bind – binding the object to a GPU virtual address (e.g. obj=0xffff888... offset=0x100000 size=0x10000). i915_gem_shrink – driver‑initiated memory reclamation (e.g. dev=0 target=0x1000000 flags=0x3). i915_gem_object_fault – page fault on a GEM object (e.g. obj=0xffff888... GTT index=128 writable).
Tracking allocation peaks, frequent re‑bindings, or shrink activity helps correlate memory pressure with frame‑rate drops.
AMD AMDGPU tracepoints
amdgpu_cs_ioctl– user‑space command submission (e.g.
sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2). amdgpu_sched_run_job – kernel scheduler starts execution of the submitted job. amdgpu_bo_create – buffer allocation (e.g.
bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1). amdgpu_bo_move – migration between VRAM and GTT, indicating PCIe bandwidth consumption. amdgpu_iv – interrupt record (e.g.
ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890).
Comparing timestamps of amdgpu_cs_ioctl and amdgpu_sched_run_job yields submission latency; values > 100 µs suggest kernel scheduling overhead. High amdgpu_bo_move frequency signals memory churn.
DRM vblank tracepoints (display synchronization)
drm_vblank_event– vblank occurrence (e.g. crtc=0 seq=12345 time=1234567890 high-prec=true). drm_vblank_event_queued and drm_vblank_event_delivered – queue‑to‑user‑space latency. Delays > 1 ms indicate compositor problems and dropped frames.
Counting vblank events validates the expected refresh rate (e.g. 60 Hz = 60 events per second).
NVIDIA proprietary driver
The NVIDIA driver ( nvidia.ko) lives outside the DRM subsystem and provides only a single tracepoint nvidia:nvidia_dev_xid for hardware errors. To observe regular activity, the script nvidia_driver.bt attaches kprobes to functions such as nvidia_open, nvidia_unlocked_ioctl, and nvidia_isr. The script installs 18 probes covering:
Device operations – open, close, ioctl (sampled at 1 % to limit overhead).
Memory management – mmap, page faults, VMA actions.
Interrupt handling – ISR, MSI‑X, bottom‑half processing with latency histograms.
P2P communication – GPU‑to‑GPU page requests and DMA mappings.
Power management – suspend/resume cycles.
Error reporting – Xid errors via nvidia:nvidia_dev_xid.
Running the NVIDIA monitor
# Verify the driver is loaded
lsmod | grep nvidia
# List available probes
sudo bpftrace -l 'kprobe:nvidia_*' | head -20
sudo bpftrace -l 'tracepoint:nvidia:*'
# Execute the monitor during a workload
cd /path/to/bpf-developer-tutorial/src/xpu/gpu-kernel-driver
sudo bpftrace scripts/nvidia_driver.btSample output (LLM server, nvtop, and a CUDA app) shows:
Attaching 18 probes...
Tracing NVIDIA GPU driver activity... Hit Ctrl-C to end.
TIME(ms) EVENT COMM PID GPU_ID DETAILS
2627 IOCTL nvtop 759434 - cmd=0xc020462a
72427 OPEN llama-server 800150 - GPU device opened
... (39 opens, 26 mmaps during initialization)
--- Device Operations ---
@opens[llama-server]: 39
@closes[llama-server]: 1
@ioctl_count: 2779
@ioctls_per_process[llama-server]: 422
@ioctls_per_process[nvtop]: 2357
--- Async Operations ---
@poll_count: 24254Analysis of this trace reveals that the LLM server opens the device many times during initialization, generates 422 ioctls for inference work, while nvtop issues 2 357 ioctls for status polling. Zero page faults and zero Xid errors indicate healthy memory allocation and hardware operation.
Running the monitoring scripts
# For DRM‑based GPUs (Intel, AMD, Nouveau)
cd /path/to/bpf-developer-tutorial/src/xpu/gpu-kernel-driver
sudo bpftrace scripts/drm_scheduler.bt
# For NVIDIA proprietary driver
cd /path/to/bpf-developer-tutorial/src/xpu/gpu-kernel-driver
sudo bpftrace scripts/nvidia_driver.btTypical DRM output displays job counts per ring and wait counts, e.g.:
TIME(ms) EVENT JOB_ID RING QUEUED DETAILS
296119090 RUN 12345 gfx 5 hw=2
...
=== DRM Scheduler Statistics ===
Jobs per ring:
@games[gfx]: 1523
@compute[compute]: 89
Waits per ring:
@games[gfx]: 12These numbers indicate whether graphics or compute workloads dominate and whether dependency stalls are present.
Limitations of kernel‑side tracing
Kernel tracepoints reveal when a job starts ( drm_run_job) and finishes, but they cannot observe inside the GPU: thread‑level execution, memory‑access patterns, warp divergence, or instruction‑level behavior. Such fine‑grained metrics are required to diagnose issues like memory‑coalescing failures or warp occupancy problems.
GPU‑side eBPF (e.g., the bpftime project) compiles eBPF bytecode to PTX, injects it into CUDA binaries, and instruments kernel entry/exit points. This approach can capture block indices, thread indices, global timers, and warp‑level counters, complementing the driver‑side tracepoints for end‑to‑end visibility.
Summary
GPU kernel tracepoints give zero‑overhead insight into driver internals. The stable gpu_scheduler tracepoints work across vendors, while vendor‑specific points (Intel i915, AMD AMDGPU, NVIDIA) expose detailed memory‑management and command‑submission pipelines. The provided bpftrace scripts demonstrate how to trace job scheduling, measure latency, and detect dependency stalls—essential steps for troubleshooting performance problems in games, machine‑learning training, and cloud GPU workloads. For deeper, GPU‑internal observability, explore the bpftime GPU eBPF capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
