Operations 20 min read

How to Use Kernel Tracepoints for Zero‑Overhead GPU Driver Monitoring

This tutorial explains how to leverage Linux kernel tracepoints with eBPF and bpftrace to capture real‑time GPU driver activity—including job scheduling, memory management, and command submission—across Intel, AMD, Nouveau, and NVIDIA GPUs, providing detailed examples, scripts, and analysis of the resulting data.

Linux Kernel Journey
Linux Kernel Journey
Linux Kernel Journey
How to Use Kernel Tracepoints for Zero‑Overhead GPU Driver Monitoring

Zero‑overhead GPU driver observability

GPU stalls in games or machine‑learning workloads often originate inside the kernel driver. Linux kernel tracepoints emit nanosecond‑resolution events exactly when the driver schedules a job, allocates memory, or emits a fence, capturing 100 % of activity with negligible overhead.

DRM scheduler tracepoints (vendor‑neutral)

The gpu_scheduler tracepoint group is a stable uAPI present in Intel i915, AMD AMDGPU, Nouveau and other DRM drivers. It provides three events: drm_run_job – job moves from software queue to hardware execution. drm_sched_process_job – job finishes and its fence is signaled. drm_sched_job_wait_dep – job blocks waiting for a dependency.

Full bpftrace script drm_scheduler.bt

#!/usr/bin/env bpftrace
BEGIN {
    printf("Tracing DRM GPU scheduler... Press Ctrl-C to end.
");
    printf("%-18s %-12s %-16s %-12s %-8s %s
",
           "Time(ms)", "Event", "JobID", "Ring", "Queued", "Details");
}
tracepoint:gpu_scheduler:drm_run_job {
    $job_id = args->id;
    $ring   = str(args->name);
    $queue  = args->job_count;
    $hw_q   = args->hw_job_count;
    @start[$job_id] = nsecs;
    printf("%-18llu %-12s %-16llu %-12s %-8u hw=%d
",
           nsecs/1000000, "RUN", $job_id, $ring, $queue, $hw_q);
    @jobs_per_ring[$ring] = count();
}
tracepoint:gpu_scheduler:drm_sched_process_job {
    $fence = args->fence;
    printf("%-18llu %-12s %-16p
", nsecs/1000000, "COMPLETE", $fence);
    @completion_count = count();
}
tracepoint:gpu_scheduler:drm_sched_job_wait_dep {
    $job_id = args->id;
    $ring   = str(args->name);
    $ctx    = args->ctx;
    $seq    = args->seqno;
    printf("%-18llu %-12s %-16llu %-12s %-8s ctx=%llu seq=%u
",
           nsecs/1000000, "WAIT_DEP", $job_id, $ring, "-", $ctx, $seq);
    @wait_count = count();
    @waits_per_ring[$ring] = count();
}
END {
    printf("
=== DRM Scheduler Statistics ===
");
    printf("
Jobs per ring:
");
    print(@jobs_per_ring);
    printf("
Waits per ring:
");
    print(@waits_per_ring);
}

The script records the start timestamp of each job ( @start[$job_id]) so that the duration can be computed later as nsecs - @start[$job_id]. Per‑ring counters ( @jobs_per_ring and @waits_per_ring) reveal workload distribution and dependency stalls.

Intel i915 low‑level tracepoints

Enable CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y to expose: i915_gem_object_create – GEM object allocation (e.g. obj=0xffff888... size=0x100000). i915_vma_bind – binding the object to a GPU virtual address (e.g. obj=0xffff888... offset=0x100000 size=0x10000). i915_gem_shrink – driver‑initiated memory reclamation (e.g. dev=0 target=0x1000000 flags=0x3). i915_gem_object_fault – page fault on a GEM object (e.g. obj=0xffff888... GTT index=128 writable).

Tracking allocation peaks, frequent re‑bindings, or shrink activity helps correlate memory pressure with frame‑rate drops.

AMD AMDGPU tracepoints

amdgpu_cs_ioctl

– user‑space command submission (e.g.

sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2

). amdgpu_sched_run_job – kernel scheduler starts execution of the submitted job. amdgpu_bo_create – buffer allocation (e.g.

bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1

). amdgpu_bo_move – migration between VRAM and GTT, indicating PCIe bandwidth consumption. amdgpu_iv – interrupt record (e.g.

ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890

).

Comparing timestamps of amdgpu_cs_ioctl and amdgpu_sched_run_job yields submission latency; values > 100 µs suggest kernel scheduling overhead. High amdgpu_bo_move frequency signals memory churn.

DRM vblank tracepoints (display synchronization)

drm_vblank_event

– vblank occurrence (e.g. crtc=0 seq=12345 time=1234567890 high-prec=true). drm_vblank_event_queued and drm_vblank_event_delivered – queue‑to‑user‑space latency. Delays > 1 ms indicate compositor problems and dropped frames.

Counting vblank events validates the expected refresh rate (e.g. 60 Hz = 60 events per second).

NVIDIA proprietary driver

The NVIDIA driver ( nvidia.ko) lives outside the DRM subsystem and provides only a single tracepoint nvidia:nvidia_dev_xid for hardware errors. To observe regular activity, the script nvidia_driver.bt attaches kprobes to functions such as nvidia_open, nvidia_unlocked_ioctl, and nvidia_isr. The script installs 18 probes covering:

Device operations – open, close, ioctl (sampled at 1 % to limit overhead).

Memory management – mmap, page faults, VMA actions.

Interrupt handling – ISR, MSI‑X, bottom‑half processing with latency histograms.

P2P communication – GPU‑to‑GPU page requests and DMA mappings.

Power management – suspend/resume cycles.

Error reporting – Xid errors via nvidia:nvidia_dev_xid.

Running the NVIDIA monitor

# Verify the driver is loaded
lsmod | grep nvidia
# List available probes
sudo bpftrace -l 'kprobe:nvidia_*' | head -20
sudo bpftrace -l 'tracepoint:nvidia:*'
# Execute the monitor during a workload
cd /path/to/bpf-developer-tutorial/src/xpu/gpu-kernel-driver
sudo bpftrace scripts/nvidia_driver.bt

Sample output (LLM server, nvtop, and a CUDA app) shows:

Attaching 18 probes...
Tracing NVIDIA GPU driver activity... Hit Ctrl-C to end.
TIME(ms)  EVENT  COMM          PID   GPU_ID  DETAILS
2627      IOCTL  nvtop         759434 -      cmd=0xc020462a
72427     OPEN   llama-server  800150 -      GPU device opened
... (39 opens, 26 mmaps during initialization)
--- Device Operations ---
@opens[llama-server]: 39
@closes[llama-server]: 1
@ioctl_count: 2779
@ioctls_per_process[llama-server]: 422
@ioctls_per_process[nvtop]: 2357
--- Async Operations ---
@poll_count: 24254

Analysis of this trace reveals that the LLM server opens the device many times during initialization, generates 422 ioctls for inference work, while nvtop issues 2 357 ioctls for status polling. Zero page faults and zero Xid errors indicate healthy memory allocation and hardware operation.

Running the monitoring scripts

# For DRM‑based GPUs (Intel, AMD, Nouveau)
cd /path/to/bpf-developer-tutorial/src/xpu/gpu-kernel-driver
sudo bpftrace scripts/drm_scheduler.bt

# For NVIDIA proprietary driver
cd /path/to/bpf-developer-tutorial/src/xpu/gpu-kernel-driver
sudo bpftrace scripts/nvidia_driver.bt

Typical DRM output displays job counts per ring and wait counts, e.g.:

TIME(ms)   EVENT   JOB_ID   RING   QUEUED  DETAILS
296119090  RUN     12345    gfx    5       hw=2
... 
=== DRM Scheduler Statistics ===
Jobs per ring:
@games[gfx]: 1523
@compute[compute]: 89
Waits per ring:
@games[gfx]: 12

These numbers indicate whether graphics or compute workloads dominate and whether dependency stalls are present.

Limitations of kernel‑side tracing

Kernel tracepoints reveal when a job starts ( drm_run_job) and finishes, but they cannot observe inside the GPU: thread‑level execution, memory‑access patterns, warp divergence, or instruction‑level behavior. Such fine‑grained metrics are required to diagnose issues like memory‑coalescing failures or warp occupancy problems.

GPU‑side eBPF (e.g., the bpftime project) compiles eBPF bytecode to PTX, injects it into CUDA binaries, and instruments kernel entry/exit points. This approach can capture block indices, thread indices, global timers, and warp‑level counters, complementing the driver‑side tracepoints for end‑to‑end visibility.

Summary

GPU kernel tracepoints give zero‑overhead insight into driver internals. The stable gpu_scheduler tracepoints work across vendors, while vendor‑specific points (Intel i915, AMD AMDGPU, NVIDIA) expose detailed memory‑management and command‑submission pipelines. The provided bpftrace scripts demonstrate how to trace job scheduling, measure latency, and detect dependency stalls—essential steps for troubleshooting performance problems in games, machine‑learning training, and cloud GPU workloads. For deeper, GPU‑internal observability, explore the bpftime GPU eBPF capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance MonitoringeBPFGPUDRMbpftracekernel tracepoints
Linux Kernel Journey
Written by

Linux Kernel Journey

Linux Kernel Journey

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.