Deduplicating 6 Billion URLs with a 1 GB Bitmap and Bloom Filter
This article explains how to use a bitmap to store 6 billion URLs within 1 GB of memory and introduces Bloom filters as a space‑efficient probabilistic structure for deduplication, providing memory calculations, usage scenarios, and Java code examples.
Memory requirement for 6 billion URLs
Storing each URL as a 32‑bit unsigned integer would need 4 bytes × 6 × 10⁹ ≈ 22.35 GB, which exceeds a 1 GB memory limit.
Bitmap (BitMap) solution
A bitmap is a contiguous array of bits where each bit represents the presence (1) or absence (0) of an element. By allocating one bit per URL, the memory consumption becomes 6 × 10⁹ bits ≈ 715 MiB (6 × 10⁹ / 8 / 1024 / 1024), which fits into 1 GB.
To insert a URL, compute a deterministic integer index (e.g., via a hash function that maps the URL to the range [0, 6 × 10⁹‑1]) and set the corresponding bit to 1. Duplicate URLs map to the same index, so the bitmap automatically deduplicates.
Bloom filter
A Bloom filter extends the bitmap idea by using k independent hash functions. When adding an element, each hash function produces an index; the bits at all k indices are set to 1. To query, the same k indices are computed; if any of them is 0 the element is definitely absent, otherwise it may be present (false positives are possible, false negatives are impossible).
Typical parameters:
n – expected number of distinct elements (e.g., 100 million).
p – desired false‑positive probability (e.g., 0.01).
m – size of the bit array, calculated as m = -(n * ln(p)) / (ln(2)^2).
k – number of hash functions, k = (m/n) * ln(2).
Bloom filters cannot delete elements without risking false deletions, and they store only binary information.
Common application scenarios
Web crawlers – avoid revisiting already fetched URLs.
Cache systems – filter out keys that are definitely not in cache, mitigating cache‑penetration attacks.
Distributed systems – reduce network traffic by performing local membership checks.
Spam and blacklist filtering – quickly test whether an email address, IP or phone number is in a blocklist.
Using Bloom filters in Java
Google Guava
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
public class BloomFilterExample {
public static void main(String[] args) {
// Expect 100 elements, false‑positive rate 1%
BloomFilter<String> filter = BloomFilter.create(
Funnels.stringFunnel(), 100, 0.01);
filter.put("Lynn");
filter.put("666");
filter.put("study");
System.out.println(filter.mightContain("Lynn")); // true
System.out.println(filter.mightContain("张三")); // false
}
}Apache Commons Collections
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.collections4.BloomFilter;
import org.apache.commons.collections4.functors.HashFunctionIdentity;
public class BloomFilterExample {
public static void main(String[] args) {
BloomFilter<String> filter = new BloomFilter<>(
HashFunctionIdentity.hashFunction(StringUtils::hashCode),
100, 0.01);
filter.put("Lynn");
filter.put("666");
filter.put("study");
System.out.println(filter.mightContain("Lynn")); // true
System.out.println(filter.mightContain("张三")); // false
}
}Redis with Redisson
Config config = new Config();
config.useSingleServer().setAddress("redis://127.0.0.1:6379");
RedissonClient redisson = Redisson.create(config);
RBloomFilter<String> filter = redisson.getBloomFilter("myfilter");
filter.tryInit(100, 0.01); // n = 100, false‑positive = 1%
filter.add("Lynn");
filter.add("666");
filter.add("study");
System.out.println(filter.contains("Lynn")); // true
System.out.println(filter.contains("张三")); // false
redisson.shutdown();Redis with Jedis (Bloom module)
Jedis jedis = new Jedis("localhost");
jedis.bfCreate("myfilter", 100, 0.01);
jedis.bfAdd("myfilter", "Lynn");
jedis.bfAdd("myfilter", "666");
jedis.bfAdd("myfilter", "study");
System.out.println(jedis.bfExists("myfilter", "Lynn")); // 1
System.out.println(jedis.bfExists("myfilter", "张三")); // 0
jedis.close();Illustrations
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
