Fundamentals 9 min read

How to Dynamically Decompress CUDA Fatbin Files Compressed by NVCC

This article explains why enabling NVCC's --fatbin-options -compress-all breaks remote GPU calls, describes the fatbin file layout, shows how to extract and analyze the binary with objcopy, and provides a step‑by‑step implementation of a decompression routine for both ELF and PTX sections.

Infra Learning Club

Feb 23, 2025

How to Dynamically Decompress CUDA Fatbin Files Compressed by NVCC

NVCC compression flag

When building the cuda-sample project with --fatbin-options -compress-all, NVCC compresses the generated fatbin. In remote GPU invocation scenarios the compressed fatbin cannot be executed, as shown by the test results.

fatbin file structure

fatbin is a simple container format similar to a pcap network packet. It begins with a global header followed by a sequence of sections. Each section has its own header and a payload that contains either an ELF (optionally compressed) or a PTX binary.

Extracting a fatbin

Example using the vectorAdd daemon from the cuda-sample project: make EXTRA_NVCCFLAGS=--cudart=shared produces vectorAdd.o.

objcopy -O binary -j .nv_fatbin vectorAdd.out vectorAdd.fatbin

extracts the embedded fatbin. ./fatbinary vectorAdd.fatbin prints the parsed structure.

fatbin size composition

The total size is the sum of the PTX header, ELF header, PTX payload size, and ELF payload size. Example calculation: total file 0x6b90 (27536) bytes, header 0x10 (16) bytes, PTX header 0x48 (72) bytes, ELF header 10 × 0x40 (640) bytes, PTX size 0x210 (528) bytes, ELF size 0x8c8*4 + 0xae8*2 + 0xb68*3 + 0xb90 = 26296 bytes. After decompression the size becomes 0x6d9a (28058) bytes.

Key structures

typedef struct __attribute__((__packed__)) {</code><code>    uint32_t magic;</code><code>    uint16_t version;</code><code>    uint16_t header_size;</code><code>    uint64_t size;</code><code>} fat_elf_header;

typedef struct __attribute__((__packed__)) {</code><code>    uint16_t kind;</code><code>    uint16_t unknown1;</code><code>    uint32_t header_size;</code><code>    uint64_t size;</code><code>    uint32_t compressed_size;</code><code>    uint32_t unknown2;</code><code>    uint16_t minor;</code><code>    uint16_t major;</code><code>    uint32_t arch;</code><code>    uint32_t obj_name_offset;</code><code>    uint32_t obj_name_len;</code><code>    uint64_t flags;</code><code>    uint64_t zero;</code><code>    uint64_t decompressed_size;</code><code>} fat_text_header;

Flag constants used in the flags field:

#define FATBIN_TEXT_MAGIC     0xBA55ED50</code><code>#define FATBIN_FLAG_64BIT     0x0000000000000001LL</code><code>#define FATBIN_FLAG_DEBUG     0x0000000000000002LL</code><code>#define FATBIN_FLAG_LINUX     0x0000000000000010LL</code><code>#define FATBIN_FLAG_COMPRESS  0x0000000000002000LL

kind

= 2 indicates an ELF section; = 1 indicates a PTX section. The FATBIN_FLAG_COMPRESS bit signals that the payload is compressed.

Decompression algorithm

The following C function implements the NVCC fatbin decompression. It reads control bytes, copies literal segments, then copies back‑references for compressed runs.

size_t decompress(const uint8_t *input, size_t input_size,
                  uint8_t *output, size_t output_size) {
    size_t ipos = 0, opos = 0;
    uint64_t next_nclen, next_clen, back_offset;
    while (ipos < input_size) {
        next_nclen = (input[ipos] & 0xf0) >> 4;
        next_clen = 4 + (input[ipos] & 0xf);
        if (next_nclen == 0xf) {
            do { next_nclen += input[++ipos]; } while (input[ipos] == 0xff);
        }
        if (memcpy(output + opos, input + (++ipos), next_nclen) == NULL) {
            fprintf(stderr, "Error copying data");
            return 0;
        }
        ipos += next_nclen;
        opos += next_nclen;
        if (ipos >= input_size || opos >= output_size) break;
        back_offset = input[ipos] + (input[ipos + 1] << 8);
        ipos += 2;
        if (next_clen == 0xf + 4) {
            do { next_clen += input[ipos++]; } while (input[ipos - 1] == 0xff);
        }
        if (next_clen <= back_offset) {
            if (memcpy(output + opos, output + opos - back_offset, next_clen) == NULL) {
                fprintf(stderr, "Error copying data");
                return 0;
            }
        } else {
            if (memcpy(output + opos, output + opos - back_offset, back_offset) == NULL) {
                fprintf(stderr, "Error copying data");
                return 0;
            }
            for (size_t i = back_offset; i < next_clen; i++) {
                output[opos + i] = output[opos + i - back_offset];
            }
        }
        opos += next_clen;
    }
    return opos;
}

Registration stub

NVCC injects __cudaRegisterFatBinary into the host binary. The stub locates the fatbin header, calls the decompression routine, and replaces bin->data with the uncompressed byte stream before registering it with the GPU.

extern "C" __host__ void **__cudaRegisterFatBinary(void *fatCubin) {
    __fatBinC_Wrapper_t *bin = (__fatBinC_Wrapper_t *)fatCubin;
    char *tmp_data = (char *)bin->data;
    fatBinaryHeader *header = (fatBinaryHeader *)tmp_data;
    char *buffer = CudaUtil::DecompressedFatBin(tmp_data,
                     header->header_size + header->size);
    bin->data = (unsigned long long *)buffer;
    // data now points to the decompressed byte stream
}

Before registration the code parses each section to obtain kernel argument counts and types, which are required for subsequent kernel launches.

Parsing logic

Typical handling of a section:

if (fatTextHeader.kind == 2) {
    // ELF payload
} else if (fatTextHeader.kind == 1) {
    // PTX payload
}
if (fatTextHeader.flags & FATBIN_FLAG_COMPRESS) {
    // payload is compressed – call decompress()
} else {
    // payload is uncompressed – read directly
}

References

GVirtuS project provides a non‑compressed parser: https://github.com/gvirtus/GVirtuS

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CUDA GPU Binary Format decompression fatbin nvcc

Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.