How to Trace CUDA GPU Operations with eBPF
This tutorial explains how to build an eBPF‑based tracing tool that intercepts CUDA runtime API calls via uprobes, captures detailed event data such as memory sizes, transfer directions, kernel launches and errors, and presents it in a readable format for debugging and performance analysis.
Introduction
CUDA programs run on GPUs with separate memory spaces, making debugging and performance analysis difficult. By attaching eBPF uprobes to the CUDA runtime library ( libcudart.so), API calls can be captured before they reach the device, exposing memory allocation sizes, data transfer directions, kernel launch parameters, error codes, and timestamps.
CUDA and GPU Tracing Overview
Typical CUDA execution involves:
Host allocates memory on the device.
Data is transferred from host to device.
A kernel is launched to process the data.
Results are copied back to the host.
Device memory is released.
Key API functions include cudaMalloc, cudaFree, cudaMemcpy, cudaLaunchKernel, stream and event functions, and device management calls.
Key CUDA Functions Traced
cudaMalloc: allocation size and success. cudaFree: detection of memory leaks and double frees. cudaMemcpy: size and kind (host‑to‑device, device‑to‑host, device‑to‑device). cudaLaunchKernel: kernel start and result. cudaStreamCreate / cudaStreamSynchronize: stream creation and synchronization. cudaEventCreate, cudaEventRecord, cudaEventSynchronize: event timing. cudaGetDevice / cudaSetDevice: device selection.
Architecture
The tracer consists of three components:
Header ( cuda_events.h ) : defines data structures shared between kernel and user space.
eBPF program ( cuda_events.bpf.c ) : implements uprobes for each CUDA function.
User‑space application ( cuda_events.c ) : loads the eBPF object, processes events from a ring buffer, and prints them.
Core Data Structures
The central structure is struct event defined in cuda_events.h:
struct event {
int pid; // Process ID
char comm[TASK_COMM_LEN]; // Process name
enum cuda_event_type type; // Event type
union {
struct { size_t size; } mem; // malloc / memcpy
struct { void *ptr; } free_data; // free
struct { size_t size; int kind; } memcpy_data; // memcpy
struct { void *func; } launch; // kernel launch
struct { int device; } device; // device ops
struct { void *handle; } handle; // stream/event ops
};
bool is_return; // true for return probe
int ret_val; // return value
char details[MAX_DETAILS_LEN]; // human‑readable details
};An enum cuda_event_type enumerates all traced operations, e.g., CUDA_EVENT_MALLOC, CUDA_EVENT_MEMCPY, CUDA_EVENT_LAUNCH_KERNEL, etc.
eBPF Program Implementation
A ring buffer ( rb) is created to pass events to user space:
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");For each CUDA API, an entry and a return probe are defined. Example for cudaMalloc:
static inline int submit_malloc_event(size_t size, bool is_return, int ret_val) {
struct event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) return 0;
e->pid = bpf_get_current_pid_tgid() >> 32;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
e->type = CUDA_EVENT_MALLOC;
e->is_return = is_return;
if (is_return) { e->ret_val = ret_val; }
else { e->mem.size = size; }
bpf_ringbuf_submit(e, 0);
return 0;
}Entry probe:
SEC("uprobe")
int BPF_KPROBE(cuda_malloc_enter, void **ptr, size_t size) {
return submit_malloc_event(size, false, 0);
}Return probe:
SEC("uretprobe")
int BPF_KRETPROBE(cuda_malloc_exit, int ret) {
return submit_malloc_event(0, true, ret);
}Similar patterns are used for cudaMemcpy, cudaLaunchKernel, streams, events, and device management calls.
User‑Space Application Details
Command‑line options are parsed into an env struct:
static struct env {
bool verbose;
bool print_timestamp;
char *cuda_library_path;
bool include_returns;
int target_pid;
} env = { .print_timestamp = true, .include_returns = true, .cuda_library_path = NULL, .target_pid = -1 };The program loads the eBPF object with libbpf, attaches each probe to the specified library, and polls the ring buffer:
while (!exiting) {
err = ring_buffer__poll(rb, 100); // 100 ms timeout
/* error handling */
}When an event arrives, handle_event formats and prints it, optionally adding a timestamp:
static int handle_event(void *ctx, void *data, size_t data_sz) {
const struct event *e = data;
if (e->is_return && !env.include_returns) return 0;
time_t t = time(&t);
struct tm *tm = localtime(&t);
char ts[32]; strftime(ts, sizeof(ts), "%H:%M:%S", tm);
char details[MAX_DETAILS_LEN];
get_event_details(e, details, sizeof(details));
if (env.print_timestamp) printf("%-8s ", ts);
printf("%-16s %-7d %-20s %8s %s
", e->comm, e->pid,
event_type_str(e->type), e->is_return ? "[EXIT]" : "[ENTER]", details);
return 0;
} get_event_detailsswitches on e->type and builds a human‑readable string, e.g., size=4000 bytes for malloc entry or returned=OutOfMemory for a failed return.
Compilation and Execution
Run make to build two binaries: cuda_events: the tracing tool. basic02: a simple CUDA example.
Typical usage:
sudo ./cuda_events -p ./basic02
./basic02PID filtering:
./basic02 &
PID=$!
sudo ./cuda_events -p ./basic02 -d $PIDSample output shows each API call with entry/exit markers, timestamps, process name, PID, event type, and details.
Benchmark
Running a synthetic benchmark without tracing yields:
cudaMalloc: 113.14 µs
cudaMemcpyH2D: 365.85 µs
cudaLaunchKernel: 7.82 µs
cudaMemcpyD2H: 393.55 µs
cudaFree: 0.00 µsWith the tracer attached the overhead is roughly +2 µs per call, which is negligible for most workloads. The author mentions that the bpftime user‑space runtime can further reduce overhead.
Command‑Line Options
-v: enable verbose debug output. -t: suppress timestamps. -r: hide return probes. -p PATH: path to the CUDA library or target binary. -d PID: trace only the specified process.
Next Steps
Add support for more CUDA APIs.
Record timestamps to locate performance bottlenecks.
Correlate related operations (e.g., match malloc/free).
Build visualizations for CUDA activity.
Extend to other GPU frameworks such as OpenCL or ROCm.
For full source code see the tutorial repository https://github.com/eunomia-bpf/basic-cuda-tutorial and the developer tutorial at
https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/47-cuda-events.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
