Essential CUDA Learning Guide: Basics, Compilation, and Profiling
This article walks through a practical APOD workflow for CUDA development—assessing bottlenecks, parallelizing with cuBLAS/cuFFT/Thrust, optimizing iteratively, and deploying—while covering nvcc compilation flags, PTX virtual ISA, nvprof profiling, core terminology (SP, SM, warp, grid, block, thread), indexing patterns, and unified memory references.
APOD development model
Assess identifies GPU‑accelerable bottlenecks, Parallelize applies parallel libraries such as cuBLAS, cuFFT and Thrust, Optimize performs iterative optimization passes, Deploy builds the final executable.
Compilation with nvcc
nvcc compiles both host (CPU) and device (GPU) code. Each GPU generation requires a virtual architecture flag, e.g. -arch compute_50,compute_60. PTX provides a virtual ISA for forward compatibility. Example command:
nvcc cuda.cu -o cuda -arch compute_50,compute_60Profiling with nvprof
nvprof records CUDA API calls and kernel execution. Example profiling of the deviceQuery sample shows API‑time distribution:
$ nvprof Samples/1_Utilities/deviceQuery/deviceQuery
==1158895== Profiling application: Samples/1_Utilities/deviceQuery/deviceQuery
==1158895== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 65.25% 201.60us 101 1.9950us 103ns 141.89us cuDeviceGetAttribute
22.09% 68.254us 1 68.254us 68.254us 68.254us cudaGetDeviceProperties
...CUDA terminology
SP (Streaming Processor) : basic execution unit, also called a CUDA core.
SM (Streaming Multiprocessor) : collection of SPs plus warp scheduler, registers, shared memory, etc.
Warp : group of 32 threads scheduled together.
Grid : set of all blocks launched by a kernel; can be 1‑D, 2‑D or 3‑D.
Block : group of threads within a grid; can be 1‑D, 2‑D or 3‑D and may synchronize and share memory.
Function qualifiers
__device__– runs on the GPU, callable only from other device or global functions. __global__ – kernel entry point, called from host, executes on GPU, must return void. __host__ – runs on the CPU (default when omitted).
Kernel launch syntax
Kernels are launched with triple‑chevron syntax func_name<<<grid, block>>>(args…), where grid and block specify launch dimensions.
Thread indexing
threadIdx.x / y / z– thread index within its block. blockIdx.x / y / z – block index within the grid. gridDim.x / y / z – number of blocks per dimension. blockDim.x / y / z – number of threads per block dimension.
Indexing helper functions
__device__
int getGlobalIdx_1D_1D()
{
return blockIdx.x * blockDim.x + threadIdx.x;
} __device__ int getGlobalIdx_3D_3D()
{
int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) +
(threadIdx.z * (blockDim.x * blockDim.y)) +
(threadIdx.y * blockDim.x) + threadIdx.x;
return threadId;
}Unified memory references
Further reading: CUDA C Best Practices Guide ( https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html ) and CUDA Driver API learning ( https://www.notion.so/CUDA-Driver-API-99fc7fab0d09433e92532895c8b35efa?pvs=21 ).
Sample host‑device program
int main()
{
int DATA_SIZE = 1000;
FILE* file = fopen("data.txt", "r");
float* hostData = (float*)malloc(DATA_SIZE * sizeof(float));
for (int i = 0; i < DATA_SIZE; ++i) {
fscanf(file, "%f", &hostData[i]);
}
fclose(file);
float* deviceData;
cudaMalloc((void**)&deviceData, DATA_SIZE * sizeof(float));
cudaMemcpy(deviceData, hostData, DATA_SIZE * sizeof(float), cudaMemcpyHostToDevice);
cudaFree(deviceData);
}Function qualifier examples
__device__ float DeviceFunc()– runs on device, callable from device code. __global__ void KernelFunc() – kernel executed on device, callable from host (arch > 3.0). __host__ float HostFunc() – runs on host, callable from host code.
Illustrative image
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
