Backend Development 5 min read

Detecting and Handling Gzip Bombs in Web Crawling with Python Requests

This article explains how to identify gzip‑compressed responses that may be gzip bombs, how to inspect HTTP headers and raw response data using Python's requests library, and provides command‑line and code examples for measuring compressed and uncompressed sizes without triggering decompression.

IT Services Circle

Feb 25, 2022

Detecting and Handling Gzip Bombs in Web Crawling with Python Requests

In a previous article the author described how a backend can return extremely high‑compression gzip files to crash crawlers; this piece flips the perspective and shows how a crawler can avoid falling into such gzip bombs.

The simplest defense is to keep the crawler invisible, because the bomb is only triggered when the server detects a crawler. If invisibility is not possible, the crawler can inspect the HTTP response headers. By checking resp.headers for a content-encoding field whose value includes gzip or deflate, the crawler can determine whether the response is likely a bomb; the absence of such a header usually means the response is safe.

It is also important to note that the requests library will not automatically decompress the body if you avoid accessing resp.content or resp.text. Therefore you can safely examine the headers without triggering decompression.

If you need to know the size of a compressed .gz file without decompressing it, you can use the command line: gzip -l xxx.gz The output shows the compressed size, the uncompressed size, and the compression ratio, all extracted from the file’s header.

When using requests, you can obtain the raw compressed binary data by streaming the response:

import requests
resp = requests.get(url, stream=True)
print(resp.raw.read())

The printed number represents the size of the compressed payload.

To compute the original (uncompressed) size without actually decompressing the data, you can wrap the raw bytes in a BytesIO object and use the gzip module:

import gzip
import io
import requests
resp = requests.get(url, stream=True)

decompressed = resp.raw.read()
with gzip.open(io.BytesIO(decompressed), 'rb') as g:
    g.seek(0, 2)
    origin_size = g.tell()
    print(origin_size)

The printed number, when converted to megabytes, matches the uncompressed file size (e.g., about 10 MB in the author’s test). Using this technique, a crawler can detect unusually large compressed responses and decide to discard them as potential gzip bombs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

gzip compression web crawling requests

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.