Fundamentals 12 min read

Understanding NVCC Compilation: A Step‑by‑Step Technical Guide

This article walks through the NVCC compilation pipeline, explaining how CUDA source files are transformed into host and device binaries, detailing file extensions, compilation stages, command‑line options, intermediate artifacts, and the role of registration functions such as __nv_cudaEntityRegisterCallback and __sti____cudaRegisterAll.

Infra Learning Club

Feb 22, 2025

Understanding NVCC Compilation: A Step‑by‑Step Technical Guide

NVCC (Nvidia CUDA Compiler) translates CUDA source files (.cu) into a combination of host C/C++ code and device code for Nvidia GPUs. The compiler supports both compressed (fatbin) and uncompressed outputs, which affect binary compatibility across GPU architectures.

File Extensions

.cu – CUDA source containing host and device code

.c, .cc, .cpp – C/C++ source files

.ptx – PTX intermediate assembly

.cubin – GPU binary for a single architecture

.fatbin – Fat binary that may contain multiple PTX and cubin images

.o / .obj – Host object files

.a / .lib – Static libraries

.so – Shared objects

Compilation Stages and Options

NVCC provides a table of stages, each with a corresponding option and default output name. Key stages include: --cuda – Preprocesses CUDA code and generates a .cpp.ii file. --ptx – Emits PTX ( .ptx). --cubin – Emits a cubin binary ( .cubin). --fatbin – Creates a fat binary ( .fatbin). --device-link – Links relocatable device code. --lib – Builds a static library.

Demo: Compiling a Simple CUDA Program

#include <cstdio>
__global__ void cudaKernel(void) {
    printf("GPU says hello.
");
}
int main(void) {
    cudaLaunchKernel((void*)cudaKernel, 1, 1, NULL, 0, NULL);
    cudaDeviceSynchronize();
    return 0;
}

Running the full compilation with dry‑run and keep options:

nvcc compare-simple.cu -o compare-simple --cudart=shared -keep --dryrun

The dry‑run output reveals a multi‑step process:

Step 1: Preprocess .cu to .cpp.ii using gcc -E.

Step 2: Invoke cicc (NVVM compiler) to generate PTX and auxiliary files ( .ptx, .cudafe1.c, .cudafe1.gpu, etc.).

Step 3: Assemble PTX to a cubin binary with ptxas -arch=sm_52 -o compare-simple.sm_52.cubin.

Step 4: Package cubin and PTX into a fat binary using fatbinary --create=compare-simple.fatbin ....

Step 5: Compile the host side with gcc to produce compare-simple.o.

Step 6: Link host objects and device binaries with nvlink and g++ to produce the final executable compare-simple.

Key Registration Functions

The generated host file compare-simple.cpp.ii contains two hidden functions:

static void __nv_cudaEntityRegisterCallback(void **__T0) { ... __cudaRegisterFunction(__T0, (void(*)(void))cudaKernel, "_Z10cudaKernelv", "_Z10cudaKernelv", -1); }

This function registers the device kernel name with the runtime.

static void __sti____cudaRegisterAll(void) { __cudaFatCubinHandle = __cudaRegisterFatBinary(&__fatDeviceText); __cudaRegisterFunction(...); atexit(__cudaUnregisterBinaryUtil); }

It registers the fat binary with the CUDA runtime, enabling the host to locate and launch the device code.

Conclusion

The NVCC toolchain orchestrates a series of transformations—from high‑level CUDA source to host C++ preprocessing, PTX generation, cubin creation, fat binary packaging, and final linking—while embedding registration callbacks that bind host and device symbols. Understanding each stage and the associated command‑line flags is essential for debugging, optimizing, and extending CUDA applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Compilation CUDA GPU linking gcc PTX fatbin nvcc

Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.