How to Dynamically Decompress CUDA Fatbin Files Compressed by NVCC
This article explains why enabling NVCC's --fatbin-options -compress-all breaks remote GPU calls, describes the fatbin file layout, shows how to extract and analyze the binary with objcopy, and provides a step‑by‑step implementation of a decompression routine for both ELF and PTX sections.
NVCC compression flag
When building the cuda-sample project with --fatbin-options -compress-all, NVCC compresses the generated fatbin. In remote GPU invocation scenarios the compressed fatbin cannot be executed, as shown by the test results.
fatbin file structure
fatbin is a simple container format similar to a pcap network packet. It begins with a global header followed by a sequence of sections. Each section has its own header and a payload that contains either an ELF (optionally compressed) or a PTX binary.
Extracting a fatbin
Example using the vectorAdd daemon from the cuda-sample project: make EXTRA_NVCCFLAGS=--cudart=shared produces vectorAdd.o.
objcopy -O binary -j .nv_fatbin vectorAdd.out vectorAdd.fatbinextracts the embedded fatbin. ./fatbinary vectorAdd.fatbin prints the parsed structure.
fatbin size composition
The total size is the sum of the PTX header, ELF header, PTX payload size, and ELF payload size. Example calculation: total file 0x6b90 (27536) bytes, header 0x10 (16) bytes, PTX header 0x48 (72) bytes, ELF header 10 × 0x40 (640) bytes, PTX size 0x210 (528) bytes, ELF size 0x8c8*4 + 0xae8*2 + 0xb68*3 + 0xb90 = 26296 bytes. After decompression the size becomes 0x6d9a (28058) bytes.
Key structures
typedef struct __attribute__((__packed__)) {</code><code> uint32_t magic;</code><code> uint16_t version;</code><code> uint16_t header_size;</code><code> uint64_t size;</code><code>} fat_elf_header; typedef struct __attribute__((__packed__)) {</code><code> uint16_t kind;</code><code> uint16_t unknown1;</code><code> uint32_t header_size;</code><code> uint64_t size;</code><code> uint32_t compressed_size;</code><code> uint32_t unknown2;</code><code> uint16_t minor;</code><code> uint16_t major;</code><code> uint32_t arch;</code><code> uint32_t obj_name_offset;</code><code> uint32_t obj_name_len;</code><code> uint64_t flags;</code><code> uint64_t zero;</code><code> uint64_t decompressed_size;</code><code>} fat_text_header;Flag constants used in the flags field:
#define FATBIN_TEXT_MAGIC 0xBA55ED50</code><code>#define FATBIN_FLAG_64BIT 0x0000000000000001LL</code><code>#define FATBIN_FLAG_DEBUG 0x0000000000000002LL</code><code>#define FATBIN_FLAG_LINUX 0x0000000000000010LL</code><code>#define FATBIN_FLAG_COMPRESS 0x0000000000002000LL kind= 2 indicates an ELF section; = 1 indicates a PTX section. The FATBIN_FLAG_COMPRESS bit signals that the payload is compressed.
Decompression algorithm
The following C function implements the NVCC fatbin decompression. It reads control bytes, copies literal segments, then copies back‑references for compressed runs.
size_t decompress(const uint8_t *input, size_t input_size,
uint8_t *output, size_t output_size) {
size_t ipos = 0, opos = 0;
uint64_t next_nclen, next_clen, back_offset;
while (ipos < input_size) {
next_nclen = (input[ipos] & 0xf0) >> 4;
next_clen = 4 + (input[ipos] & 0xf);
if (next_nclen == 0xf) {
do { next_nclen += input[++ipos]; } while (input[ipos] == 0xff);
}
if (memcpy(output + opos, input + (++ipos), next_nclen) == NULL) {
fprintf(stderr, "Error copying data");
return 0;
}
ipos += next_nclen;
opos += next_nclen;
if (ipos >= input_size || opos >= output_size) break;
back_offset = input[ipos] + (input[ipos + 1] << 8);
ipos += 2;
if (next_clen == 0xf + 4) {
do { next_clen += input[ipos++]; } while (input[ipos - 1] == 0xff);
}
if (next_clen <= back_offset) {
if (memcpy(output + opos, output + opos - back_offset, next_clen) == NULL) {
fprintf(stderr, "Error copying data");
return 0;
}
} else {
if (memcpy(output + opos, output + opos - back_offset, back_offset) == NULL) {
fprintf(stderr, "Error copying data");
return 0;
}
for (size_t i = back_offset; i < next_clen; i++) {
output[opos + i] = output[opos + i - back_offset];
}
}
opos += next_clen;
}
return opos;
}Registration stub
NVCC injects __cudaRegisterFatBinary into the host binary. The stub locates the fatbin header, calls the decompression routine, and replaces bin->data with the uncompressed byte stream before registering it with the GPU.
extern "C" __host__ void **__cudaRegisterFatBinary(void *fatCubin) {
__fatBinC_Wrapper_t *bin = (__fatBinC_Wrapper_t *)fatCubin;
char *tmp_data = (char *)bin->data;
fatBinaryHeader *header = (fatBinaryHeader *)tmp_data;
char *buffer = CudaUtil::DecompressedFatBin(tmp_data,
header->header_size + header->size);
bin->data = (unsigned long long *)buffer;
// data now points to the decompressed byte stream
}Before registration the code parses each section to obtain kernel argument counts and types, which are required for subsequent kernel launches.
Parsing logic
Typical handling of a section:
if (fatTextHeader.kind == 2) {
// ELF payload
} else if (fatTextHeader.kind == 1) {
// PTX payload
}
if (fatTextHeader.flags & FATBIN_FLAG_COMPRESS) {
// payload is compressed – call decompress()
} else {
// payload is uncompressed – read directly
}References
GVirtuS project provides a non‑compressed parser: https://github.com/gvirtus/GVirtuS
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
