Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis
Bloom filters are space‑efficient probabilistic structures that answer “definitely not” or “maybe” membership queries, with a controllable false‑positive rate derived from bit array size, element count, and hash functions, and can be implemented via Guava’s Java library, Redisson’s Redis wrapper, native Redis modules, or custom bitmap code, dramatically reducing memory usage and latency in large‑scale systems such as URL deduplication or user‑product checks.
1. Background
In many Internet scenarios we need to quickly determine whether a target datum exists among massive data sets (e.g., checking if a URL has been crawled). Storing such flags directly in disk leads to heavy I/O, while keeping them in memory with a simple key‑value store (e.g., Redis) becomes prohibitive when the volume reaches tens of millions. Bloom Filter provides a space‑efficient, high‑throughput solution.
2. Understanding Bloom Filter
2.1 What is a Bloom Filter?
A Bloom Filter, proposed by Bloom in 1970, is essentially a long bit array together with several independent hash functions.
It is a probabilistic data structure that can answer the question: an element is definitely not present or possibly present .
2.2 Advantages
Constant space and O(1) insertion/query time (the number of hash functions is fixed).
Hash functions are independent and can be parallelized in hardware.
No need to store the actual elements, which is useful for privacy‑sensitive scenarios.
Can represent a universal set that no other structure can.
2.3 Disadvantages
There is a non‑zero false‑positive rate.
Elements cannot be deleted.
2.4 Typical Use Cases
Preventing cache penetration.
Web crawler URL deduplication.
Email spam filtering.
Black‑/white‑list checks.
3. Bloom Filter Mechanics
3.1 Structure
A Bloom Filter consists of a large bit array of length m and k different hash functions. For example, a 16‑bit array with three hash functions.
3.2 Adding Elements
When adding an element, each of the k hash functions produces a position in the bit array, and those bits are set to 1. Example with two elements:
Hash1(data1)=1
Hash2(data1)=8
Hash3(data1)=13 Hash1(data2)=2
Hash2(data2)=5
Hash3(data2)=13Both elements share the bit at position 13, illustrating why multiple hash functions are needed to minimise collisions.
3.3 Querying Elements
To test membership, compute the k hash positions. If any of those bits is 0, the element is definitely absent; if all are 1, the element may be present.
Hash1(data1)=1
Hash2(data1)=8
Hash3(data1)=13 Hash1(data3)=2
Hash2(data3)=8
Hash3(data3)=13 Hash1(data4)=1
Hash2(data4)=8
Hash3(data4)=12Data4’s third hash maps to bit 12, which is 0, therefore it is definitely not in the filter.
4. False‑Positive Rate
The false‑positive probability depends on three parameters:
m : number of bits in the array.
n : number of inserted elements.
k : number of hash functions.
Derivation (omitted for brevity) leads to the classic formula:
p = (1 - e^{-kn/m})^k
Given m and n , the optimal k that minimises p is k = (m/n) * ln2 . Conversely, for a target false‑positive rate, the required bit length is m = -(n * ln p) / (ln 2)^2 .
5. Implementation Methods
5.1 Guava (Java)
Google Guava provides a ready‑made Bloom Filter implementation.
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>23.0</version>
</dependency>Typical usage:
private static void GuavaBloomFilter() {
// Create a BloomFilter instance
BloomFilter
bloomFilter = BloomFilter.create(
Funnels.stringFunnel(Charset.defaultCharset()),
EXPECTED_INSERTIONS,
FALSE_PROBABILITY);
// Add elements
bloomFilter.put("element001");
bloomFilter.put("element003");
// Query
System.out.println(bloomFilter.mightContain("element001")); // true
System.out.println(bloomFilter.mightContain("element002")); // false
System.out.println(bloomFilter.approximateElementCount()); // 2
System.out.println(bloomFilter.expectedFpp());
}5.2 Redis‑based Implementations
5.2.1 Redisson (Java client)
Redisson offers a Bloom Filter wrapper that works on a Redis cluster.
<dependency>
<groupId>org.redisson</groupId>
<artifactId>redisson</artifactId>
<version>3.22.1</version>
</dependency> private static void RedissonBloomFilter() {
Config config = new Config();
config.useSingleServer().setAddress("redis://" + REDIS_IP + ":" + REDIS_PORT);
config.useSingleServer().setPassword(REDIS_PASSWORD);
RedissonClient client = Redisson.create(config);
RBloomFilter
filter = client.getBloomFilter("myFilter");
filter.tryInit(EXPECTED_INSERTIONS, FALSE_PROBABILITY);
filter.add("element001");
filter.add("element003");
System.out.println(filter.contains("element002")); // false
System.out.println(filter.contains("element001")); // true
System.out.println(filter.count()); // 2
System.out.println(filter.getExpectedInsertions());
System.out.println(filter.getFalseProbability());
System.out.println(filter.getHashIterations());
System.out.println(filter.getSize());
}5.2.2 Redis 4.0 Bloom Filter Module
Redis 4.0 ships with native Bloom Filter commands (BF.RESERVE, BF.ADD, BF.EXISTS, etc.). Example command list is provided in the source.
5.2.3 Custom Bitmap Implementation
A DIY approach using Redis BITMAP and Java hashing (MurMurHash3) is also described. The full source code is included in the original article.
public class BloomFilterUtils {
// ... (fields, constructor, hash calculation, add, contains, main)
}6. Business Application
6.1 Scenario
In a video‑exposure campaign, the system needs to decide whether a user has any “on‑shelf” products. With billions of users and millions of product‑owner users, a Bloom Filter reduces memory from several gigabytes (when using plain Redis sets) to a few dozen megabytes.
6.2 Choice of Implementation
Implementation
Storage
Applicable Scenario
Remarks
Guava
Local machine
Single‑node
No external resources needed
Redisson
Redis
Distributed
Just connect to Redis
Redis plugin
Redis
Distributed
Requires cluster‑wide plugin support
Redis bitmap
Redis
Distributed
Custom implementation required
For a distributed service environment, Guava is unsuitable, the Redis plugin is costly to configure, while Redisson offers the best trade‑off.
6.3 Effectiveness
Memory consumption dropped dramatically after switching to Bloom Filters (e.g., from 9.52 GB to 308.8 MB for the user set, and from 2.717 GB to 2.756 GB for Redis with only 39.4 MB of actual filter data). Additionally, the end‑to‑end latency of the user‑center page decreased by about 5 ms due to fewer service‑to‑service calls.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.