Big Data 10 min read

Deduplicating 6 Billion URLs with a 1 GB Bitmap and Bloom Filter

This article explains how to use a bitmap to store 6 billion URLs within 1 GB of memory and introduces Bloom filters as a space‑efficient probabilistic structure for deduplication, providing memory calculations, usage scenarios, and Java code examples.

Java Architect Essentials
Java Architect Essentials
Java Architect Essentials
Deduplicating 6 Billion URLs with a 1 GB Bitmap and Bloom Filter

Memory requirement for 6 billion URLs

Storing each URL as a 32‑bit unsigned integer would need 4 bytes × 6 × 10⁹ ≈ 22.35 GB, which exceeds a 1 GB memory limit.

Bitmap (BitMap) solution

A bitmap is a contiguous array of bits where each bit represents the presence (1) or absence (0) of an element. By allocating one bit per URL, the memory consumption becomes 6 × 10⁹ bits ≈ 715 MiB (6 × 10⁹ / 8 / 1024 / 1024), which fits into 1 GB.

To insert a URL, compute a deterministic integer index (e.g., via a hash function that maps the URL to the range [0, 6 × 10⁹‑1]) and set the corresponding bit to 1. Duplicate URLs map to the same index, so the bitmap automatically deduplicates.

Bloom filter

A Bloom filter extends the bitmap idea by using k independent hash functions. When adding an element, each hash function produces an index; the bits at all k indices are set to 1. To query, the same k indices are computed; if any of them is 0 the element is definitely absent, otherwise it may be present (false positives are possible, false negatives are impossible).

Typical parameters:

n – expected number of distinct elements (e.g., 100 million).

p – desired false‑positive probability (e.g., 0.01).

m – size of the bit array, calculated as m = -(n * ln(p)) / (ln(2)^2).

k – number of hash functions, k = (m/n) * ln(2).

Bloom filters cannot delete elements without risking false deletions, and they store only binary information.

Common application scenarios

Web crawlers – avoid revisiting already fetched URLs.

Cache systems – filter out keys that are definitely not in cache, mitigating cache‑penetration attacks.

Distributed systems – reduce network traffic by performing local membership checks.

Spam and blacklist filtering – quickly test whether an email address, IP or phone number is in a blocklist.

Using Bloom filters in Java

Google Guava

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

public class BloomFilterExample {
    public static void main(String[] args) {
        // Expect 100 elements, false‑positive rate 1%
        BloomFilter<String> filter = BloomFilter.create(
                Funnels.stringFunnel(), 100, 0.01);
        filter.put("Lynn");
        filter.put("666");
        filter.put("study");
        System.out.println(filter.mightContain("Lynn")); // true
        System.out.println(filter.mightContain("张三")); // false
    }
}

Apache Commons Collections

import org.apache.commons.lang3.StringUtils;
import org.apache.commons.collections4.BloomFilter;
import org.apache.commons.collections4.functors.HashFunctionIdentity;

public class BloomFilterExample {
    public static void main(String[] args) {
        BloomFilter<String> filter = new BloomFilter<>(
                HashFunctionIdentity.hashFunction(StringUtils::hashCode),
                100, 0.01);
        filter.put("Lynn");
        filter.put("666");
        filter.put("study");
        System.out.println(filter.mightContain("Lynn")); // true
        System.out.println(filter.mightContain("张三")); // false
    }
}

Redis with Redisson

Config config = new Config();
config.useSingleServer().setAddress("redis://127.0.0.1:6379");
RedissonClient redisson = Redisson.create(config);
RBloomFilter<String> filter = redisson.getBloomFilter("myfilter");
filter.tryInit(100, 0.01);          // n = 100, false‑positive = 1%
filter.add("Lynn");
filter.add("666");
filter.add("study");
System.out.println(filter.contains("Lynn")); // true
System.out.println(filter.contains("张三")); // false
redisson.shutdown();

Redis with Jedis (Bloom module)

Jedis jedis = new Jedis("localhost");
jedis.bfCreate("myfilter", 100, 0.01);
jedis.bfAdd("myfilter", "Lynn");
jedis.bfAdd("myfilter", "666");
jedis.bfAdd("myfilter", "study");
System.out.println(jedis.bfExists("myfilter", "Lynn")); // 1
System.out.println(jedis.bfExists("myfilter", "张三")); // 0
jedis.close();

Illustrations

Bitmap example
Bitmap example
Bloom filter concept
Bloom filter concept
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaBig DataMemory OptimizationBitmapdeduplicationData Structuresbloom-filter
Java Architect Essentials
Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.