Inside NVCC: How CUDA Code Is Compiled and Linked
The article dissects NVCC’s compilation pipeline, showing how internal registration functions from host_runtime.h are injected into the host binary, how a simple CUDA demo is processed with --dryrun, and how the generated fatbin, PTX, and cubin files are linked and registered for GPU execution.
NVCC Compilation Overview
CUDA’s nvcc compiler defines several internal registration functions in /usr/local/cuda-11.8/include/crt/host_runtime.h. These macros ( __cudaRegisterBinary, __cudaRegisterVariable, __cudaRegisterManagedVariable, __cudaRegisterGlobalTexture, __cudaRegisterGlobalSurface, __cudaRegisterEntry) are injected into the host application before main. Their role is to register the fatbin and cubin that contain the device code so that the CUDA runtime can launch kernels on the GPU.
#define __cudaRegisterBinary(X) \
__cudaFatCubinHandle = __cudaRegisterFatBinary((void*)&__fatDeviceText); \
{ void (*callback_fp)(void **) = (void (*)(void **))(X); (*callback_fp)(__cudaFatCubinHandle); __cudaRegisterFatBinaryEnd(__cudaFatCubinHandle); } \
atexit(__cudaUnregisterBinaryUtil)
#define __cudaRegisterVariable(handle, var, ext, size, constant, global) \
__cudaRegisterVar(handle, (char*)&__host##var, (char*)__device##var, __name##var, ext, size, constant, global)
#define __cudaRegisterManagedVariable(handle, var, ext, size, constant, global) \
__cudaRegisterManagedVar(handle, (void **)&__host##var, (char*)__device##var, __name##var, ext, size, constant, global)
#define __cudaRegisterGlobalTexture(handle, tex, dim, norm, ext) \
__cudaRegisterTexture(handle, (const struct textureReference*)&tex, (const void**)(void*)__device##tex, __name##tex, dim, norm, ext)
#define __cudaRegisterGlobalSurface(handle, surf, dim, ext) \
__cudaRegisterSurface(handle, (const struct surfaceReference*)&surf, (const void**)(void*)__device##surf, __name##surf, dim, ext)
#define __cudaRegisterEntry(handle, funptr, fun, thread_limit) \
__cudaRegisterFunction(handle, (const char*)funptr, (char*)__device_fun(fun), #fun, -1, (uint3*)0, (uint3*)0, (dim3*)0, (dim3*)0, (int*)0)To illustrate the process, a minimal demo compare-simple.cu is created:
#include <cstdio>
__global__ void cudaKernel(void) {
printf("GPU says hello.
");
}
int main(void) {
cudaLaunchKernel((void*)cudaKernel, 1, 1, NULL, 0, NULL);
cudaDeviceSynchronize();
return 0;
}Running
nvcc compare-simple.cu -o compare-simple --cudart=shared -keep --dryrunprints the full compilation workflow without executing it. The --dryrun flag shows each internal command, while --cuda limits the run to preprocessing only.
Command to obtain the full process:
nvcc compare-simple.cu -o compare-simple --cudart=shared -keep --dryrunOmit --cudart for a pure dry‑run; add --cudart=shared to enable dynamic linking.
The output reveals a series of steps:
$ nvcc compare-simple.cu -o compare-simple --cudart=shared -keep --dryrun
# Environment variables and include/library paths are set
# gcc preprocessing generates compare-simple.cpp1.ii
# cicc compiles to PTX and produces intermediate files:
# compare-simple.cudafe1.c, .gpu, .stub.c, .module_id, .ptx
# ptxas assembles PTX to a cubin
# fatbinary creates a fatbin containing both PTX and cubin
# nvlink links device objects
# final host linking produces the executable "compare-simple"Key generated files include compare-simple.cudafe1.gpu (device code with the mangled kernel name _Z10cudaKernelv), compare-simple.cudafe1.stub.c (host stubs), and the fatbin that bundles PTX and cubin.
Inspecting compare-simple.cudafe1.stub.c reveals two crucial registration functions:
static void __nv_cudaEntityRegisterCallback(void **__T0) {
__nv_save_fatbinhandle_for_managed_rt(__T0);
__cudaRegisterFunction(__T0, (void(*)(void))cudaKernel, "_Z10cudaKernelv", "_Z10cudaKernelv", -1);
}
static void __sti____cudaRegisterAll(void) {
__cudaRegisterFatBinary(__nv_cudaEntityRegisterCallback);
} __nv_cudaEntityRegisterCallbackregisters the kernel function with the CUDA runtime by calling __cudaRegisterFunction. __sti____cudaRegisterAll registers the entire fat binary via __cudaRegisterFatBinary. These callbacks are executed at program start, ensuring the GPU can locate and launch the kernel.
In summary, NVCC compiles CUDA source by:
Preprocessing the host code with gcc. Generating device code (PTX) with cicc and assembling it to a cubin with ptxas . Packaging PTX and cubin into a fatbin . Injecting registration functions into the host binary that register the fatbin and kernel symbols with the CUDA runtime. Linking everything into the final executable.
This pipeline explains how the CUDA runtime discovers and launches kernels at runtime.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
