Fundamentals 10 min read

Inside NVCC: How CUDA Code Is Compiled and Linked

The article dissects NVCC’s compilation pipeline, showing how internal registration functions from host_runtime.h are injected into the host binary, how a simple CUDA demo is processed with --dryrun, and how the generated fatbin, PTX, and cubin files are linked and registered for GPU execution.

Infra Learning Club

Jan 24, 2025

Inside NVCC: How CUDA Code Is Compiled and Linked

NVCC Compilation Overview

CUDA’s nvcc compiler defines several internal registration functions in /usr/local/cuda-11.8/include/crt/host_runtime.h. These macros ( __cudaRegisterBinary, __cudaRegisterVariable, __cudaRegisterManagedVariable, __cudaRegisterGlobalTexture, __cudaRegisterGlobalSurface, __cudaRegisterEntry) are injected into the host application before main. Their role is to register the fatbin and cubin that contain the device code so that the CUDA runtime can launch kernels on the GPU.

#define __cudaRegisterBinary(X) \
        __cudaFatCubinHandle = __cudaRegisterFatBinary((void*)&__fatDeviceText); \
        { void (*callback_fp)(void **) = (void (*)(void **))(X); (*callback_fp)(__cudaFatCubinHandle); __cudaRegisterFatBinaryEnd(__cudaFatCubinHandle); } \
        atexit(__cudaUnregisterBinaryUtil)

#define __cudaRegisterVariable(handle, var, ext, size, constant, global) \
        __cudaRegisterVar(handle, (char*)&__host##var, (char*)__device##var, __name##var, ext, size, constant, global)

#define __cudaRegisterManagedVariable(handle, var, ext, size, constant, global) \
        __cudaRegisterManagedVar(handle, (void **)&__host##var, (char*)__device##var, __name##var, ext, size, constant, global)

#define __cudaRegisterGlobalTexture(handle, tex, dim, norm, ext) \
        __cudaRegisterTexture(handle, (const struct textureReference*)&tex, (const void**)(void*)__device##tex, __name##tex, dim, norm, ext)

#define __cudaRegisterGlobalSurface(handle, surf, dim, ext) \
        __cudaRegisterSurface(handle, (const struct surfaceReference*)&surf, (const void**)(void*)__device##surf, __name##surf, dim, ext)

#define __cudaRegisterEntry(handle, funptr, fun, thread_limit) \
        __cudaRegisterFunction(handle, (const char*)funptr, (char*)__device_fun(fun), #fun, -1, (uint3*)0, (uint3*)0, (dim3*)0, (dim3*)0, (int*)0)

To illustrate the process, a minimal demo compare-simple.cu is created:

#include <cstdio>
__global__ void cudaKernel(void) {
    printf("GPU says hello.
");
}
int main(void) {
    cudaLaunchKernel((void*)cudaKernel, 1, 1, NULL, 0, NULL);
    cudaDeviceSynchronize();
    return 0;
}

Running

nvcc compare-simple.cu -o compare-simple --cudart=shared -keep --dryrun

prints the full compilation workflow without executing it. The --dryrun flag shows each internal command, while --cuda limits the run to preprocessing only.

Command to obtain the full process:

nvcc compare-simple.cu -o compare-simple --cudart=shared -keep --dryrun

Omit --cudart for a pure dry‑run; add --cudart=shared to enable dynamic linking.

The output reveals a series of steps:

$ nvcc compare-simple.cu -o compare-simple --cudart=shared -keep --dryrun
# Environment variables and include/library paths are set
# gcc preprocessing generates compare-simple.cpp1.ii
# cicc compiles to PTX and produces intermediate files:
#   compare-simple.cudafe1.c, .gpu, .stub.c, .module_id, .ptx
# ptxas assembles PTX to a cubin
# fatbinary creates a fatbin containing both PTX and cubin
# nvlink links device objects
# final host linking produces the executable "compare-simple"

Key generated files include compare-simple.cudafe1.gpu (device code with the mangled kernel name _Z10cudaKernelv), compare-simple.cudafe1.stub.c (host stubs), and the fatbin that bundles PTX and cubin.

Inspecting compare-simple.cudafe1.stub.c reveals two crucial registration functions:

static void __nv_cudaEntityRegisterCallback(void **__T0) {
    __nv_save_fatbinhandle_for_managed_rt(__T0);
    __cudaRegisterFunction(__T0, (void(*)(void))cudaKernel, "_Z10cudaKernelv", "_Z10cudaKernelv", -1);
}

static void __sti____cudaRegisterAll(void) {
    __cudaRegisterFatBinary(__nv_cudaEntityRegisterCallback);
}

__nv_cudaEntityRegisterCallback

registers the kernel function with the CUDA runtime by calling __cudaRegisterFunction. __sti____cudaRegisterAll registers the entire fat binary via __cudaRegisterFatBinary. These callbacks are executed at program start, ensuring the GPU can locate and launch the kernel.

In summary, NVCC compiles CUDA source by:

Preprocessing the host code with gcc. Generating device code (PTX) with cicc and assembling it to a cubin with ptxas . Packaging PTX and cubin into a fatbin . Injecting registration functions into the host binary that register the fatbin and kernel symbols with the CUDA runtime. Linking everything into the final executable.

This pipeline explains how the CUDA runtime discovers and launches kernels at runtime.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Compilation CUDA GPU PTX nvcc FatBinary Kernel Registration

Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.