Fundamentals 8 min read

Essential CUDA Learning Guide: Basics, Compilation, and Profiling

This article walks through a practical APOD workflow for CUDA development—assessing bottlenecks, parallelizing with cuBLAS/cuFFT/Thrust, optimizing iteratively, and deploying—while covering nvcc compilation flags, PTX virtual ISA, nvprof profiling, core terminology (SP, SM, warp, grid, block, thread), indexing patterns, and unified memory references.

Infra Learning Club

Jan 31, 2025

Essential CUDA Learning Guide: Basics, Compilation, and Profiling

APOD development model

Assess identifies GPU‑accelerable bottlenecks, Parallelize applies parallel libraries such as cuBLAS, cuFFT and Thrust, Optimize performs iterative optimization passes, Deploy builds the final executable.

Compilation with nvcc

nvcc compiles both host (CPU) and device (GPU) code. Each GPU generation requires a virtual architecture flag, e.g. -arch compute_50,compute_60. PTX provides a virtual ISA for forward compatibility. Example command:

nvcc cuda.cu -o cuda -arch compute_50,compute_60

Profiling with nvprof

nvprof records CUDA API calls and kernel execution. Example profiling of the deviceQuery sample shows API‑time distribution:

$ nvprof Samples/1_Utilities/deviceQuery/deviceQuery
==1158895== Profiling application: Samples/1_Utilities/deviceQuery/deviceQuery
==1158895== Profiling result:
No kernels were profiled.
            Type  Time(%)    Time   Calls    Avg    Min    Max  Name
API calls:  65.25%  201.60us   101  1.9950us 103ns 141.89us cuDeviceGetAttribute
            22.09%   68.254us    1  68.254us 68.254us 68.254us cudaGetDeviceProperties
            ...

CUDA terminology

SP (Streaming Processor) : basic execution unit, also called a CUDA core.

SM (Streaming Multiprocessor) : collection of SPs plus warp scheduler, registers, shared memory, etc.

Warp : group of 32 threads scheduled together.

Grid : set of all blocks launched by a kernel; can be 1‑D, 2‑D or 3‑D.

Block : group of threads within a grid; can be 1‑D, 2‑D or 3‑D and may synchronize and share memory.

Function qualifiers

__device__

– runs on the GPU, callable only from other device or global functions. __global__ – kernel entry point, called from host, executes on GPU, must return void. __host__ – runs on the CPU (default when omitted).

Kernel launch syntax

Kernels are launched with triple‑chevron syntax func_name<<<grid, block>>>(args…), where grid and block specify launch dimensions.

Thread indexing

threadIdx.x / y / z

– thread index within its block. blockIdx.x / y / z – block index within the grid. gridDim.x / y / z – number of blocks per dimension. blockDim.x / y / z – number of threads per block dimension.

Indexing helper functions

__device__
int getGlobalIdx_1D_1D()
{
    return blockIdx.x * blockDim.x + threadIdx.x;
}

__device__ int getGlobalIdx_3D_3D()
{
    int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
    int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) +
                  (threadIdx.z * (blockDim.x * blockDim.y)) +
                  (threadIdx.y * blockDim.x) + threadIdx.x;
    return threadId;
}

Unified memory references

Further reading: CUDA C Best Practices Guide ( https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html ) and CUDA Driver API learning ( https://www.notion.so/CUDA-Driver-API-99fc7fab0d09433e92532895c8b35efa?pvs=21 ).

Sample host‑device program

int main()
{
    int DATA_SIZE = 1000;
    FILE* file = fopen("data.txt", "r");
    float* hostData = (float*)malloc(DATA_SIZE * sizeof(float));
    for (int i = 0; i < DATA_SIZE; ++i) {
        fscanf(file, "%f", &hostData[i]);
    }
    fclose(file);
    float* deviceData;
    cudaMalloc((void**)&deviceData, DATA_SIZE * sizeof(float));
    cudaMemcpy(deviceData, hostData, DATA_SIZE * sizeof(float), cudaMemcpyHostToDevice);
    cudaFree(deviceData);
}

Function qualifier examples

__device__ float DeviceFunc()

– runs on device, callable from device code. __global__ void KernelFunc() – kernel executed on device, callable from host (arch > 3.0). __host__ float HostFunc() – runs on host, callable from host code.

Illustrative image

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CUDA parallel computing GPU programming nvcc CUDA terminology nvprof

Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.