Operations 13 min read

Bringing eBPF Inside GPU Kernels: The bpftime for GPU Breakthrough

The article introduces bpftime for GPU, a tool that extends eBPF's programmable, low‑overhead observation capabilities into GPU kernels, explains its implementation pipeline, compares its performance against Nsight and NVBit, and outlines future enhancements for GPU profiling.

Linux Kernel Journey

May 5, 2026

Bringing eBPF Inside GPU Kernels: The bpftime for GPU Breakthrough

01 Introduction

On April 18, 2026 at the fourth eBPF Developer Conference (eBPFDC 2026), bpftime for GPU was presented to bring eBPF's programmable observation into GPU kernel internals.

02 Why GPUs Need eBPF

Existing tools such as Nsight Systems, Nsight Compute, and CUPTI provide external, coarse‑grained metrics (kernel runtime, SM occupancy, cache hit rate, warp scheduling, CUDA API traces) but cannot answer per‑thread or per‑instruction questions like where a thread stalls or what happens on a kernel's return path. bpftime for GPU aims to fill this gap.

03 How bpftime for GPU Works

The core implementation, nv_attach_impl, builds a dynamic instrumentation pipeline consisting of the following steps:

Intercept CUDA module loading : Hook __cudaRegisterFatBinary() via Frida‑gum to gain control when a GPU module is loaded.

Extract PTX : Use cuobjdump --extract-ptx to pull PTX assembly from the fat binary.

Apply PTX Passes : Rewrite PTX according to the probe type (kprobe, kretprobe, memcapture).

Compile eBPF probes to PTX : User‑written eBPF C programs are JIT‑compiled by bpftime’s LLVM backend and embedded into the rewritten kernel.

Automatic register protection : Generate save/restore logic so probe execution does not corrupt the original kernel state.

Recompile to cubin : Use nvPTXCompiler to turn the modified PTX into an executable cubin.

Replace original GPU module : Transparently swap the new module in; the application runs unchanged.

The first attach costs about 100 ms, mainly for PTX rewriting and recompilation; subsequent launches reuse the loaded module without extra overhead.

04 Core Technical Highlight: PTX‑Level Instrumentation

Placing instrumentation at the PTX level offers two main advantages: PTX is more portable than SASS, allowing the same probe logic to run on different GPU architectures (e.g., sm_75, sm_80, sm_90), and it is better suited for compile‑time and intermediate‑level programmatic rewrites.

05 GPU‑Specific Maps and Helpers

bpftime extends the eBPF map and helper ecosystem for GPU use. Map types include GPU_HASH_MAP, PERGPUTD_ARRAY_MAP, GPU_ARRAY_MAP, GPU_KERNEL_SHARED_ARRAY_MAP, PERGPUTD_ARRAY_HOST_MAP, GPU_ARRAY_HOST_MAP, and the key GPU_RINGBUF_MAP, a per‑thread lock‑free ring buffer that uses UVA for zero‑copy GPU→Host data transfer.

Supported helpers are: bpf_get_globaltimer(): nanosecond‑level global clock bpf_get_thread_idx(): retrieve

threadIdx

bpf_get_sm_id()

: retrieve SM ID bpf_get_warp_id(): retrieve warp ID bpf_get_lane_id(): retrieve lane ID ebpf_puts(): print from GPU to Host bpf_gpu_membar(): insert memory barrier bpf_gpu_exit(): terminate current thread

These enable developers to write familiar eBPF C probes that run inside GPU kernels.

06 Performance Comparison

On the CPU side, bpftime shows significant speedups over kernel eBPF (up to 16.3× for combined uprobe/uretprobe and 14–21× for user‑space memory reads) because it avoids kernel‑mode switches.

On the GPU side, benchmarks of 10,000 vector additions show:

NVIDIA P40 : baseline 51.8 µs, bpftime 81.1 µs, NVBit 174.4 µs

NVIDIA RTX 5090 : baseline 4.1 µs, bpftime 8.2 µs, NVBit 55.8 µs

Thus bpftime’s probe overhead is markedly lower than NVBit, with up to 6.8× better performance on newer architectures.

07 Current Probe Support

bpftime for GPU currently supports three probe categories:

GPU kprobe : attaches at kernel entry, fires once per block, useful for entry timestamps, call counts, blockIdx, etc.

GPU kretprobe : attaches before each ret, fires per thread, captures exit time and thread state.

Memory Capture : samples load/store instructions to analyze memory access patterns and hotspots.

More than 20 example programs (e.g., cuda-counter, kernelretsnoop, threadhist, mem_trace, pytorch-test, llama-cpp-test, faiss-test, cutlass, cudagraph, gpu_shared_map, rocm-counter) demonstrate these capabilities.

08 Unique Value of bpftime

Compared with Nsight, CUPTI, and NVBit, bpftime offers:

Programmable probes rather than fixed metrics.

Per‑thread data collection, moving from aggregated statistics to thread‑level diagnostics.

Instrumentation inside the kernel body, not just at API or external layers.

Memory‑access‑pattern tracing.

Its PTX‑level approach provides cross‑architecture portability, crucial for cloud and heterogeneous GPU deployments.

09 Future Roadmap

Planned work includes extending memcapture (register‑value tracking, shared‑memory analysis), adding an AMD ROCm/HIP backend (supporting GCN/RDNA), building real‑time streaming profiling with an online visual UI, and integrating bpftrace for one‑liner probe expressions.

These efforts have already appeared in several academic conferences, indicating a transition from engineering prototype to systematic methodology.

END Conclusion

bpftime for GPU demonstrates that eBPF can be extended from CPU kernels to GPU kernels, providing programmable, per‑thread, zero‑copy observation without modifying application code or requiring root.

Repository: https://github.com/eunomia-bpf/bpftime

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

eBPF GPU Profiling dynamic instrumentation PTX bpftime

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.