Operations 13 min read

GPUprobe: Using eBPF to Monitor CUDA Memory Leaks

The article introduces GPUprobe, an eBPF‑based tool that provides lightweight, continuous, application‑level monitoring of CUDA memory allocation, leaks, and kernel launches, compares it with NSight Systems and DCGM, and demonstrates near‑zero overhead integration with Prometheus and Grafana through detailed code examples and real‑world output analysis.

Infra Learning Club
Infra Learning Club
Infra Learning Club
GPUprobe: Using eBPF to Monitor CUDA Memory Leaks

GPUprobe is a new observability tool that leverages eBPF uprobes to attach to the CUDA runtime library ( libcudart.so) and collect fine‑grained data about memory allocations, deallocations, and kernel launches without modifying the target program.

Why Existing GPU Monitoring Falls Short

Traditional approaches have notable drawbacks:

CUDA helper functions return only an enum cudaError_t, requiring developers to insert printf statements and manually check stdout for errors.

Nsight Systems relies on CUPTI, needs an explicit profiling session, adds 2‑10× slowdown, and is designed for post‑run analysis rather than continuous production monitoring.

DCGM (NVIDIA Data Center GPU Manager) gathers system‑level metrics such as GPU utilization, memory usage, temperature, and health, but cannot show per‑process allocation patterns, track individual CUDA kernel launches, or reliably detect memory leaks.

GPUprobe fills these gaps by offering near‑zero‑overhead runtime monitoring, an application‑level view, and modern observability integration.

GPUprobe Features

Almost zero overhead (<4 % impact in benchmark tests).

Zero‑intrusion – no code changes required in the target application.

Per‑process memory‑allocation tracking, including leak detection.

Visibility into each CUDA kernel launch and the actual function name.

Export of metrics to Prometheus and ready‑made Grafana dashboards.

eBPF Basics Used by GPUprobe

eBPF can attach uprobe (entry) and uretprobe (return) probes to any user‑space function. When the probed function is called, the eBPF program receives a notification and can read or record arguments and return values. This mechanism incurs minimal overhead because the program runs in the kernel without requiring binary instrumentation.

Uprobe Usage in GPUprobe

GPUprobe attaches uprobes to key CUDA runtime APIs such as cudaMalloc(), cudaFree() and cudaLaunchKernel(). The probes capture allocation size, device pointer, process ID, timestamps, and return codes, then push a memleak_event structure into a BPF queue for user‑space processing.

struct memleak_event {
    __u64 start;
    __u64 end;
    void *device_addr;
    __u64 size;
    __u32 pid;
    int32 ret;
    enum memleak_event_t event_type;
};

struct {
    __uint(type, BPF_MAP_TYPE_QUEUE);
    __uint(key_size, 0);
    __type(value, struct memleak_event);
    __uint(max_entries, 1024);
} memleak_events_queue SEC(".maps");

To correlate cudaMalloc calls with their return values, GPUprobe stores the user‑space pointer devPtr in a hash map keyed by PID. The uretprobe later reads the actual device address from user memory using bpf_probe_read_user.

/// uprobe triggered by a call to `cudaMalloc`
SEC("uprobe/cudaMalloc")
int memleak_cuda_malloc(struct pt_regs *ctx) {
    struct memleak_event e = {};
    void **dev_ptr;
    u32 pid = bpf_get_current_pid_tgid();
    e.size = (size_t)PT_REGS_PARM2(ctx);
    dev_ptr = (void **)PT_REGS_PARM1(ctx);
    e.event_type = CUDA_MALLOC;
    e.start = bpf_ktime_get_ns();
    e.pid = pid;
    if (bpf_map_update_elem(&memleak_pid_to_event, &pid, &e, 0))
        return -1;
    return bpf_map_update_elem(&memleak_pid_to_dev_ptr, &pid, &dev_ptr, 0);
}

/// uretprobe triggered when `cudaMalloc` returns
SEC("uretprobe/cudaMalloc")
int memleak_cuda_malloc_ret(struct pt_regs *ctx) {
    int ret = (int)PT_REGS_RC(ctx);
    u32 pid = bpf_get_current_pid_tgid();
    struct memleak_event *e = bpf_map_lookup_elem(&memleak_pid_to_event, &pid);
    if (!e) return -1;
    e->ret = ret;
    void ***map_ptr = bpf_map_lookup_elem(&memleak_pid_to_dev_ptr, &pid);
    if (!map_ptr) return -1;
    void **dev_ptr = *map_ptr;
    if (bpf_probe_read_user(&e->device_addr, sizeof(void *), dev_ptr))
        return -1;
    e->end = bpf_ktime_get_ns();
    return bpf_map_push_elem(&memleak_events_queue, e, 0);
}

Memory‑Leak Detection Logic

GPUprobe maintains a per‑process map of allocated device pointers. On cudaMalloc it records the pointer and size; on cudaFree it removes the entry. When a process exits, all its entries are cleared. The user‑space daemon reads events from the queue, updates the in‑memory map, and periodically exports aggregated metrics.

End‑to‑End Example

A small CUDA program allocates three buffers, launches two kernels repeatedly, and intentionally forgets to free one buffer. Running the program with the gpuprobe‑daemon produces output such as:

2024-12-21 16:32:46

num_successful_mallocs:  3
num_failed_mallocs:      0
num_successful_frees:    0
num_failed_frees:         0
per-process memory maps:
process 365159
        0x0000793a44000000: 8000000 Bytes
        0x0000793a48c00000: 8000000 Bytes
        0x0000793a49400000: 8000000 Bytes

total kernel launches: 1470
pid: 365159
        0x5de98f9fba50 (_Z27optimized_convolution_part1PdS_i) -> 735
        0x5de98f9fbb30 (_Z27optimized_convolution_part2PdS_i) -> 735

===

2024-12-21 16:32:51

num_successful_mallocs:  3
num_failed_mallocs:      0
num_successful_frees:    2
num_failed_frees:         0
per-process memory maps:
process 365159
        0x0000793a44000000: 8000000 Bytes
        0x0000793a48c00000: 0 Bytes
        0x0000793a49400000: 0 Bytes

total kernel launches: 2000
pid: 365159
        0x5de98f9fba50 (_Z27optimized_convolution_part1PdS_i) -> 1000
        0x5de98f9fbb30 (_Z27optimized_convolution_part2PdS_i) -> 1000

The first interval shows three successful allocations and 735 launches of each kernel; the second interval shows that two of the three buffers were freed and the kernels were launched 1 000 times each. These metrics are then exported to Prometheus and visualized in Grafana, providing real‑time views of per‑process CUDA memory usage and kernel activity.

GPUprobe also offers a "memleak" dashboard that aggregates memory usage per process (shown in orange) and a "cudatrace" view that lists kernel launches with human‑readable names.

In summary, GPUprobe demonstrates how eBPF uprobes can provide a lightweight, continuous, and application‑centric observability layer for CUDA programs, overcoming the limitations of existing NVIDIA tooling while keeping performance impact below 4 %.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

observabilityPrometheuseBPFGrafanamemory leak detectionGPU monitoring
Infra Learning Club
Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.