Fundamentals 14 min read

How to Build a gzip Decompressor in Rust – Inside DEFLATE, Huffman & LZ77

The article walks through creating a compact 250‑line Rust gzip decompressor from scratch, explaining gzip’s header, DEFLATE block types, bit‑reading logic, Huffman coding, and LZ77 back‑references, and shows how to test it with real compressed data, highlighting key implementation challenges and trade‑offs.

Code Mala Tang
Code Mala Tang
Code Mala Tang
How to Build a gzip Decompressor in Rust – Inside DEFLATE, Huffman & LZ77

I wanted to understand how compression works, so I wrote a gzip decompressor from scratch in Rust. The resulting program is about 250 lines and can decompress gzip data from a file or standard input.

Why bother?

gzip is everywhere – it compresses network traffic, log files, documentation pages, and is the format used by CommonCrawl to store billions of web pages. It silently reduces bytes between disk, network, and CPU, making it a fundamental tool in the software ecosystem.

Instead of digging through the 25,569‑line C implementation of zlib, I looked for a smaller reference. The Rust wrapper zlib‑rs is even larger (36,003 lines). The minimal C library miniz is only 1,261 lines when header and source are combined, but it still contains many optimizations and pre‑processor tricks that obscure the core ideas.

gzip wrapper

gzip is essentially a thin wrapper around the DEFLATE algorithm (the same algorithm used in .zip and .png). A gzip file starts with the magic bytes 0x1F 0x8B, followed by flags and metadata, then the compressed data.

assert!(buf[0] == 0x1F && buf[1] == 0x8B, "not a gzip file");
let mut offset = 10; // skip fixed header
if buf[3] == 8 { // FNAME flag – filename present
    offset += buf[10..].iter().position(|&b| b == 0).unwrap() + 1;
}
// other flags are ignored

The only flag I care about is FNAME , which contains the original filename. After skipping it, the real work happens in the inflate function that processes the DEFLATE stream. Other gzip fields (OS, original size, etc.) are ignored for decompression.

DEFLATE blocks

DEFLATE data is organized into blocks. Each block begins with a 1‑bit flag indicating whether it is the last block, followed by a 2‑bit type:

Type 0 : stored (uncompressed) data

Type 1 : compressed with fixed Huffman codes

Type 2 : compressed with dynamic Huffman codes

loop {
    let last = s.bits(1);
    match s.bits(2) {
        0 => stored(&mut s, &mut out),
        1 => fixed(&mut s, &mut out),
        2 => dynamic(&mut s, &mut out),
        _ => panic!("invalid block type"),
    }
    if last != 0 { break; }
}

Stored blocks are trivial – just copy bytes. The interesting work is in fixed and dynamic blocks.

Reading bits

DEFLATE packs bits from the least‑significant to the most‑significant within each byte. The bits helper maintains a buffer of unread bits and refills it as needed.

fn bits(&mut self, need: i32) -> i32 {
    let mut val = self.bit_buffer;
    while self.bit_count < need {
        val |= (self.nextbyte() as i32) << self.bit_count;
        self.bit_count += 8;
    }
    self.bit_buffer = val >> need;
    self.bit_count -= need;
    val & ((1i32 << need) - 1)
}

Huffman coding

Huffman coding assigns shorter bit patterns to frequent symbols and longer patterns to rare symbols. DEFLATE uses canonical Huffman codes, where the code lengths fully determine the actual bit patterns.

fn construct(h: &mut Huffman, length: &[u16], n: usize) -> i32 {
    h.count[..=MAXBITS].fill(0);
    for i in 0..n { h.count[length[i] as usize] += 1; }
    let mut offs = [0i16; MAXBITS + 1];
    for len in 1..MAXBITS { offs[len + 1] = offs[len] + h.count[len]; }
    for sym in 0..n {
        if length[sym] != 0 {
            h.symbol[offs[length[sym] as usize] as usize] = sym as i16;
            offs[length[sym] as usize] += 1;
        }
    }
    // ...
}

Decoding proceeds bit‑by‑bit: read a bit, extend the current code, and check whether it falls within the range of a valid symbol.

fn decode(s: &mut State, h: &Huffman) -> i32 {
    let (mut len, mut code, mut first, mut index) = (1, 0, 0, 0);
    loop {
        code |= bitbuf & 1;
        bitbuf >>= 1;
        let count = h.count[next_idx] as i32;
        if code < first + count { return h.symbol[(index + code - first) as usize] as i32; }
        index += count;
        first = (first + count) << 1;
        code <<= 1;
        len += 1;
        // ...
    }
}

Fixed vs. dynamic codes

Fixed Huffman blocks use a predefined set of code lengths (e.g., literals 0‑143 get 8 bits, 144‑255 get 9 bits). They are simple but not optimal for all data.

Dynamic blocks embed their own Huffman tables at the start of the block, allowing the compressor to tailor the codes to the actual data. Most real‑world gzip files use dynamic blocks.

LZ77 back‑references

Beyond Huffman coding, DEFLATE achieves compression with LZ77 back‑references. Literal symbols (0‑255) are emitted directly; symbols 257‑285 encode length/distance pairs that say “copy len bytes from dist bytes ago”.

if sym < 256 {
    s.output(out, sym as u8);
} else if sym > 256 {
    let idx = (sym - 257) as usize;
    let len = LENS[idx] as i32 + s.bits(LEXT[idx] as i32);
    let dsym = decode(s, dc) as usize;
    let dist = DISTS[dsym] as u32 + s.bits(DEXT[dsym] as i32) as u32;
    for _ in 0..len {
        let c = s.window[s.next.wrapping_sub(dist as usize) & (MAXDIST - 1)];
        s.output(out, c);
    }
}

The length/distance tables include extra bits for fine‑grained values, allowing lengths up to 258 bytes and distances up to 32 KB. A 32 KB sliding window holds recently output bytes, and the distance is masked with & (MAXDIST - 1) for fast wrap‑around.

Forward references

LZ77 can also reference data that has not yet been output, enabling “copy‑while‑producing”. For example, the string aaaaaaaaa can be encoded as a literal a followed by a back‑reference “copy 8 bytes from 1 byte ago”. This avoids storing the entire uncompressed data in memory.

Putting it all together

The complete decompressor is roughly 250 lines of Rust. To use it, compile the program and run it on a gzip file or pipe data through curl:

curl -s https://www.google.com | ./minigzip

The implementation successfully decompresses all valid gzip files, though it omits CRC checks and will panic on malformed input.

What I learned

gzip compression is layered – a gzip wrapper around Huffman‑encoded data, Huffman around LZ77 back‑references, and back‑references around raw bytes. Implementing each layer separately is possible, but the reference implementation does everything in a single pass.

Getting the first working version is the hard part . Starting from a tiny, readable implementation is far less intimidating than diving into a massive, highly optimized code base.

Overall, the project demonstrates that a functional gzip decompressor can be built with a modest amount of code while exposing the essential concepts of DEFLATE, Huffman coding, and LZ77.

RustgzipCompressionHuffmanLZ77DEFLATEdecompressor
Code Mala Tang
Written by

Code Mala Tang

Read source code together, write articles together, and enjoy spicy hot pot together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.