Fundamentals 13 min read

Mastering CUDA GPU Performance Analysis and Tracing

This guide walks you through a complete workflow for profiling CUDA applications, covering GPU performance fundamentals, key metrics, NVIDIA Nsight tools, CUPTI programming, example code, common bottlenecks, and best‑practice recommendations to identify and eliminate performance limits.

Linux Kernel Journey

Jul 21, 2025

Mastering CUDA GPU Performance Analysis and Tracing

Overview

This guide explains how to profile CUDA applications, identify performance bottlenecks, and optimise GPU execution. The accompanying source code is available at https://github.com/eunomia-bpf/basic-cuda-tutorial.

Analysis Tools

NVIDIA Nsight Systems

System‑level tracing : records CPU, GPU, memory and I/O activity.

Timeline visualisation : shows kernel launches, memory transfers and host work.

CUDA API tracing : captures each API call and its duration.

Low overhead : suitable for profiling production binaries.

NVIDIA Nsight Compute

Detailed kernel metrics : SM utilisation, memory throughput, instruction mix.

Guided analysis : provides optimisation hints.

Roofline analysis : compares measured performance against hardware limits.

Kernel comparison : contrasts runs on different hardware or with different code versions.

Legacy Tools

nvprof : command‑line profiler with minimal overhead.

Visual Profiler : GUI front‑end for nvprof data.

CUDA Profiling API : programmatic access to profiling information.

Additional Utilities

Compute Sanitizer : checks memory accesses and data‑race conditions.

CUPTI : CUDA Profiling Tools Interface for custom analysers.

Framework analysers : PyTorch and TensorFlow profiling extensions.

Key Performance Metrics

Execution

SM occupancy : ratio of active warps to the maximum; target >50% for most workloads.

Warp execution efficiency : percentage of active threads; target >80% for compute‑intensive kernels.

Instruction throughput : includes instructions per cycle (IPC), arithmetic intensity, and instruction‑type distribution.

Memory

Memory throughput : global‑memory read/write bandwidth, shared‑memory bandwidth, L1/L2 cache hit rates; aim to approach peak hardware bandwidth.

Access patterns : load/store efficiency, global‑memory coalescing, shared‑memory bank conflicts.

Host‑device transfers : PCIe utilisation, NVLink utilisation (if present), and overall transfer bandwidth.

Compute

Compute utilisation : SM activity, Tensor/RT core usage (when applicable), and instruction mix (FP32, FP64, INT, …).

Compute efficiency : achieved FLOPS vs theoretical peak, classification of kernels as compute‑bound or memory‑bound, and position on the roofline model.

Performance‑Analysis Methodology

Preliminary assessment : use Nsight Systems to obtain a high‑level view, locate time spent in CPU, GPU and data transfers, and spot obvious issues such as excessive synchronisation or transfers.

Kernel analysis : switch to Nsight Compute, identify the most time‑consuming kernels, and collect the metrics listed above for those kernels.

Bottleneck identification : decide whether a kernel is compute‑bound or memory‑bound, apply roofline reasoning, and inspect specific inefficiencies (warp divergence, non‑coalesced accesses, bank conflicts).

Guided optimisation : address the dominant bottleneck first, change one parameter at a time, and re‑measure.

Iterative improvement : repeat the process for subsequent bottlenecks, periodically re‑profile the whole application, and continue until performance goals are met.

Typical Bottlenecks and Remedies

Memory‑related

Non‑coalesced accesses : symptom – low global‑memory load/store efficiency; remedy – restructure data layout or access pattern.

Shared‑memory bank conflicts : symptom – reduced shared‑memory bandwidth; remedy – add padding or change access pattern.

Excessive global traffic : symptom – high memory dependency; remedy – increase data reuse via shared memory or registers.

Execution‑related

Warp divergence : symptom – low warp execution efficiency; remedy – redesign algorithm to minimise divergent branches.

Low occupancy : symptom – SM occupancy <50%; remedy – reduce register/shared‑memory usage or adjust block size.

Kernel launch overhead : symptom – many short‑lived kernels; remedy – kernel fusion or persistent kernels.

System‑level

Excessive host‑device transfers : symptom – high PCIe utilisation; remedy – batch transfers, use pinned or unified memory.

CPU‑GPU synchronisation : symptom – GPU idle periods between kernels; remedy – employ CUDA streams and asynchronous operations.

Under‑utilised GPU resources : symptom – low overall GPU utilisation; remedy – launch concurrent kernels, use streams, or increase problem size.

Tracing Techniques

CUDA events

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
myKernel<<<grid, block>>>(data);
cudaEventRecord(stop);
cudaEventSynchronize(stop);

float ms = 0.0f;
cudaEventElapsedTime(&ms, start, stop);
printf("Kernel execution time: %f ms
", ms);

NVTX markers and ranges

#include <nvtx3/nvToolsExt.h>

// Instantaneous marker
nvtxMark("Interesting point");

// Range with custom colour
nvtxRangePushA("Data preparation");
// ... code ...
nvtxRangePop();

nvtxEventAttributes_t attr = {0};
attr.version = NVTX_VERSION;
attr.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;
attr.colorType = NVTX_COLOR_ARGB;
attr.color = 0xFF00FF00; // green
attr.messageType = NVTX_MESSAGE_TYPE_ASCII;
attr.message.ascii = "Kernel Execution";
nvtxRangePushEx(&attr);
myKernel<<<grid, block>>>(data);
nvtxRangePop();

CUPTI programming

#include <cupti.h>

void CUPTIAPI callbackHandler(void *userdata, CUpti_CallbackDomain domain,
                              CUpti_CallbackId cbid, const void *cbInfo) {
    // Process callback
}

CUpti_SubscriberHandle subscriber;
cuptiSubscribe(&subscriber, callbackHandler, NULL);
cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API,
                    CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020);

Example Application (basic08.cu)

The sample demonstrates:

Kernel timing with CUDA events.

NVTX annotations for custom profiling regions.

Analysis of host‑device memory transfers.

Comparison of alternative kernel implementations.

Interpretation of profiling data to guide optimisation decisions.

Key kernel‑timing code

__global__ void computeKernel(float *input, float *output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float x = input[idx];
        float result = x * x + x + 1.0f;
        output[idx] = result;
    }
}

void timeKernel() {
    float *d_input, *d_output;
    cudaMalloc(&d_input, SIZE * sizeof(float));
    cudaMalloc(&d_output, SIZE * sizeof(float));

    float *h_input = new float[SIZE];
    for (int i = 0; i < SIZE; ++i) h_input[i] = i;
    cudaMemcpy(d_input, h_input, SIZE * sizeof(float), cudaMemcpyHostToDevice);

    // Warm‑up launch
    computeKernel<<<(SIZE + 255) / 256, 256>>>(d_input, d_output, SIZE);

    // Timed launch
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);
    computeKernel<<<(SIZE + 255) / 256, 256>>>(d_input, d_output, SIZE);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float ms = 0.0f;
    cudaEventElapsedTime(&ms, start, stop);
    printf("Kernel execution time: %f ms
", ms);

    delete[] h_input;
    cudaFree(d_input);
    cudaFree(d_output);
}

Best‑Practice Workflow

Start with system‑level profiling, then drill down to kernel metrics.

Establish baselines for critical kernels.

Analyse regularly throughout development, not only at the end.

Automate profiling where possible to support regression testing.

Compare results across hardware generations to verify portability.