Backend Development 8 min read

How to Eliminate Duplicate URLs in Large-Scale Python Crawlers

This article explains five practical techniques—list storage, in‑memory set, MD5 hashing, bitmap compression, and Bloom filter—to efficiently deduplicate URLs during large‑scale Python web crawling, highlighting their trade‑offs in speed, memory usage, and collision risk.

Python Crawling & Data Mining

Nov 30, 2018

How to Eliminate Duplicate URLs in Large-Scale Python Crawlers

When crawling a website, extracting all URLs from a start page and recursively following links can lead to cycles where the same page is visited repeatedly, wasting resources and preventing other pages from being crawled.

1. List‑Based Deduplication

Store each visited URL in a list (or database). Before fetching a new URL, check whether it already exists in the list and skip it if so. This method is simple but incurs many database lookups, reducing efficiency at scale.

2. In‑Memory Set

Keep visited URLs in a set, which provides O(1) lookup time. This is fast for small to medium crawlers, but the set resides in memory; with billions of URLs the memory consumption becomes prohibitive.

3. MD5 Hashing

Apply an MD5 hash to each URL and store the 128‑bit hash value instead of the full string. This reduces memory per URL dramatically (e.g., from ~100 bytes to 16 bytes) while still allowing uniqueness checks. Frameworks like Scrapy use a similar approach.

4. Bitmap Compression

Allocate a bit array where each bit represents the presence of a URL after hashing. For example, 1 billion URLs require about 125 MB of bits. This method compresses memory further but suffers from high collision rates because many different URLs may map to the same bit.

5. Bloom Filter

A Bloom filter improves the bitmap idea by using multiple independent hash functions, dramatically lowering the probability of false positives while retaining the low‑memory advantage. It is well‑suited for massive, distributed crawlers and is often combined with other deduplication strategies.

For multi‑process crawlers, you may need inter‑process communication (e.g., pipes) to share the deduplication data structure, as a simple in‑memory set is not shared across processes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Deduplication Data Structures bloom filter

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.