Understanding NVCC Compilation: A Step‑by‑Step Technical Guide
This article walks through the NVCC compilation pipeline, explaining how CUDA source files are transformed into host and device binaries, detailing file extensions, compilation stages, command‑line options, intermediate artifacts, and the role of registration functions such as __nv_cudaEntityRegisterCallback and __sti____cudaRegisterAll.
NVCC (Nvidia CUDA Compiler) translates CUDA source files (.cu) into a combination of host C/C++ code and device code for Nvidia GPUs. The compiler supports both compressed (fatbin) and uncompressed outputs, which affect binary compatibility across GPU architectures.
File Extensions
.cu – CUDA source containing host and device code
.c, .cc, .cpp – C/C++ source files
.ptx – PTX intermediate assembly
.cubin – GPU binary for a single architecture
.fatbin – Fat binary that may contain multiple PTX and cubin images
.o / .obj – Host object files
.a / .lib – Static libraries
.so – Shared objects
Compilation Stages and Options
NVCC provides a table of stages, each with a corresponding option and default output name. Key stages include: --cuda – Preprocesses CUDA code and generates a .cpp.ii file. --ptx – Emits PTX ( .ptx). --cubin – Emits a cubin binary ( .cubin). --fatbin – Creates a fat binary ( .fatbin). --device-link – Links relocatable device code. --lib – Builds a static library.
Demo: Compiling a Simple CUDA Program
#include <cstdio>
__global__ void cudaKernel(void) {
printf("GPU says hello.
");
}
int main(void) {
cudaLaunchKernel((void*)cudaKernel, 1, 1, NULL, 0, NULL);
cudaDeviceSynchronize();
return 0;
}Running the full compilation with dry‑run and keep options:
nvcc compare-simple.cu -o compare-simple --cudart=shared -keep --dryrunThe dry‑run output reveals a multi‑step process:
Step 1: Preprocess .cu to .cpp.ii using gcc -E.
Step 2: Invoke cicc (NVVM compiler) to generate PTX and auxiliary files ( .ptx, .cudafe1.c, .cudafe1.gpu, etc.).
Step 3: Assemble PTX to a cubin binary with ptxas -arch=sm_52 -o compare-simple.sm_52.cubin.
Step 4: Package cubin and PTX into a fat binary using fatbinary --create=compare-simple.fatbin ....
Step 5: Compile the host side with gcc to produce compare-simple.o.
Step 6: Link host objects and device binaries with nvlink and g++ to produce the final executable compare-simple.
Key Registration Functions
The generated host file compare-simple.cpp.ii contains two hidden functions:
static void __nv_cudaEntityRegisterCallback(void **__T0) { ... __cudaRegisterFunction(__T0, (void(*)(void))cudaKernel, "_Z10cudaKernelv", "_Z10cudaKernelv", -1); }This function registers the device kernel name with the runtime.
static void __sti____cudaRegisterAll(void) { __cudaFatCubinHandle = __cudaRegisterFatBinary(&__fatDeviceText); __cudaRegisterFunction(...); atexit(__cudaUnregisterBinaryUtil); }It registers the fat binary with the CUDA runtime, enabling the host to locate and launch the device code.
Conclusion
The NVCC toolchain orchestrates a series of transformations—from high‑level CUDA source to host C++ preprocessing, PTX generation, cubin creation, fat binary packaging, and final linking—while embedding registration callbacks that bind host and device symbols. Understanding each stage and the associated command‑line flags is essential for debugging, optimizing, and extending CUDA applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
