Fundamentals 17 min read

How to Trace CUDA GPU Operations with eBPF

This tutorial explains how to build an eBPF‑based tracing tool that intercepts CUDA runtime API calls via uprobes, captures detailed event data such as memory sizes, transfer directions, kernel launches and errors, and presents it in a readable format for debugging and performance analysis.

Linux Kernel Journey

Jun 9, 2025

How to Trace CUDA GPU Operations with eBPF

Introduction

CUDA programs run on GPUs with separate memory spaces, making debugging and performance analysis difficult. By attaching eBPF uprobes to the CUDA runtime library ( libcudart.so), API calls can be captured before they reach the device, exposing memory allocation sizes, data transfer directions, kernel launch parameters, error codes, and timestamps.

CUDA and GPU Tracing Overview

Typical CUDA execution involves:

Host allocates memory on the device.

Data is transferred from host to device.

A kernel is launched to process the data.

Results are copied back to the host.

Device memory is released.

Key API functions include cudaMalloc, cudaFree, cudaMemcpy, cudaLaunchKernel, stream and event functions, and device management calls.

Key CUDA Functions Traced

cudaMalloc

: allocation size and success. cudaFree: detection of memory leaks and double frees. cudaMemcpy: size and kind (host‑to‑device, device‑to‑host, device‑to‑device). cudaLaunchKernel: kernel start and result. cudaStreamCreate / cudaStreamSynchronize: stream creation and synchronization. cudaEventCreate, cudaEventRecord, cudaEventSynchronize: event timing. cudaGetDevice / cudaSetDevice: device selection.

Architecture

The tracer consists of three components:

Header ( cuda_events.h ) : defines data structures shared between kernel and user space.

eBPF program ( cuda_events.bpf.c ) : implements uprobes for each CUDA function.

User‑space application ( cuda_events.c ) : loads the eBPF object, processes events from a ring buffer, and prints them.

Core Data Structures

The central structure is struct event defined in cuda_events.h:

struct event {
    int pid;               // Process ID
    char comm[TASK_COMM_LEN]; // Process name
    enum cuda_event_type type; // Event type
    union {
        struct { size_t size; } mem;          // malloc / memcpy
        struct { void *ptr; } free_data;      // free
        struct { size_t size; int kind; } memcpy_data; // memcpy
        struct { void *func; } launch;       // kernel launch
        struct { int device; } device;        // device ops
        struct { void *handle; } handle;       // stream/event ops
    };
    bool is_return;        // true for return probe
    int ret_val;           // return value
    char details[MAX_DETAILS_LEN]; // human‑readable details
};

An enum cuda_event_type enumerates all traced operations, e.g., CUDA_EVENT_MALLOC, CUDA_EVENT_MEMCPY, CUDA_EVENT_LAUNCH_KERNEL, etc.

eBPF Program Implementation

A ring buffer ( rb) is created to pass events to user space:

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");

For each CUDA API, an entry and a return probe are defined. Example for cudaMalloc:

static inline int submit_malloc_event(size_t size, bool is_return, int ret_val) {
    struct event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
    if (!e) return 0;
    e->pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    e->type = CUDA_EVENT_MALLOC;
    e->is_return = is_return;
    if (is_return) { e->ret_val = ret_val; }
    else { e->mem.size = size; }
    bpf_ringbuf_submit(e, 0);
    return 0;
}

Entry probe:

SEC("uprobe")
int BPF_KPROBE(cuda_malloc_enter, void **ptr, size_t size) {
    return submit_malloc_event(size, false, 0);
}

Return probe:

SEC("uretprobe")
int BPF_KRETPROBE(cuda_malloc_exit, int ret) {
    return submit_malloc_event(0, true, ret);
}

Similar patterns are used for cudaMemcpy, cudaLaunchKernel, streams, events, and device management calls.

User‑Space Application Details

Command‑line options are parsed into an env struct:

static struct env {
    bool verbose;
    bool print_timestamp;
    char *cuda_library_path;
    bool include_returns;
    int target_pid;
} env = { .print_timestamp = true, .include_returns = true, .cuda_library_path = NULL, .target_pid = -1 };

The program loads the eBPF object with libbpf, attaches each probe to the specified library, and polls the ring buffer:

while (!exiting) {
    err = ring_buffer__poll(rb, 100); // 100 ms timeout
    /* error handling */
}

When an event arrives, handle_event formats and prints it, optionally adding a timestamp:

static int handle_event(void *ctx, void *data, size_t data_sz) {
    const struct event *e = data;
    if (e->is_return && !env.include_returns) return 0;
    time_t t = time(&t);
    struct tm *tm = localtime(&t);
    char ts[32]; strftime(ts, sizeof(ts), "%H:%M:%S", tm);
    char details[MAX_DETAILS_LEN];
    get_event_details(e, details, sizeof(details));
    if (env.print_timestamp) printf("%-8s ", ts);
    printf("%-16s %-7d %-20s %8s %s
", e->comm, e->pid,
           event_type_str(e->type), e->is_return ? "[EXIT]" : "[ENTER]", details);
    return 0;
}

get_event_details

switches on e->type and builds a human‑readable string, e.g., size=4000 bytes for malloc entry or returned=OutOfMemory for a failed return.

Compilation and Execution

Run make to build two binaries: cuda_events: the tracing tool. basic02: a simple CUDA example.

Typical usage:

sudo ./cuda_events -p ./basic02
./basic02

PID filtering:

./basic02 &
PID=$!
sudo ./cuda_events -p ./basic02 -d $PID

Sample output shows each API call with entry/exit markers, timestamps, process name, PID, event type, and details.

Benchmark

Running a synthetic benchmark without tracing yields:

cudaMalloc:          113.14 µs
cudaMemcpyH2D:      365.85 µs
cudaLaunchKernel:    7.82 µs
cudaMemcpyD2H:      393.55 µs
cudaFree:            0.00 µs

With the tracer attached the overhead is roughly +2 µs per call, which is negligible for most workloads. The author mentions that the bpftime user‑space runtime can further reduce overhead.

Command‑Line Options

-v

: enable verbose debug output. -t: suppress timestamps. -r: hide return probes. -p PATH: path to the CUDA library or target binary. -d PID: trace only the specified process.

Next Steps

Add support for more CUDA APIs.

Record timestamps to locate performance bottlenecks.

Correlate related operations (e.g., match malloc/free).

Build visualizations for CUDA activity.

Extend to other GPU frameworks such as OpenCL or ROCm.

For full source code see the tutorial repository https://github.com/eunomia-bpf/basic-cuda-tutorial and the developer tutorial at

https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/47-cuda-events

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux CUDA eBPF benchmark performance analysis uprobes GPU tracing

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.