Comprehensive Guide to Bloom Filters: Principles, Implementations, and Business Applications
This article introduces Bloom filters, explains their probabilistic principles, advantages and drawbacks, details how to add and query elements, derives false‑positive formulas, provides Guava, Redisson, Redis‑module, and custom bitmap implementations with code samples, and showcases real‑world business scenarios and performance benefits.
1. Background Introduction
In the Internet we often need to determine whether a target datum exists among massive data, such as checking if a URL has already been visited by a web crawler. Storing such data on disk incurs heavy I/O; keeping it in memory works for small volumes but becomes prohibitive when the number of entries reaches tens of millions. Bloom filter offers a space‑efficient, high‑performance solution.
2. Understanding Bloom Filter
2.1 What is a Bloom Filter
Bloom filter, proposed by Bloom in 1970, is essentially a long binary vector together with a set of random hash functions.
It is a probabilistic data structure that tells you that an element definitely does not exist or may exist.
2.2 Advantages
Compared with other data structures, Bloom filter has huge advantages in both space and time; its storage size and insertion/query time are constant (i.e., the number of hash functions).
Hash functions are independent, which facilitates hardware‑level parallelism.
Bloom filter does not store the elements themselves, making it suitable for highly confidential scenarios.
Bloom filter can represent a universal set, which other data structures cannot.
2.3 Disadvantages
There is a false‑positive rate.
Deletion is not supported.
2.4 Suitable Scenarios
Prevent cache penetration: quickly decide whether data exists to avoid unnecessary database queries.
Web crawling: de‑duplicate already crawled URLs.
Email spam filtering.
Black‑white list checks.
3. Bloom Filter Principle
3.1 Structure
Bloom filter consists of a huge bit array and multiple independent hash functions. Assume the bit array length is m and the number of hash functions is k . The figure below shows a 16‑bit array with three different hash functions; each bit stores 0 or 1 and is initially 0.
Different hash functions
Specified length array
3.2 Adding Elements
When adding an element, compute k hash values, then set the k corresponding bits to 1. Example with two elements data1 and data2 :
Hash1(data1)=1 Hash2(data1)=8 Hash3(data1)=13
Thus positions 1, 8, 13 are set to 1.
For data2 the hash results are 2, 5, 13, so positions 2, 5, 13 are set to 1.
Hash1(data2)=2 Hash2(data2)=5 Hash3(data2)=13
Because different elements may share some hash positions, multiple hash functions are needed to keep the probability of collisions low.
3.3 Querying Elements
To query an element, compute its k hash values and check the corresponding bits. If any bit is 0, the element definitely does not exist; if all bits are 1, the element may exist.
Example queries:
Hash1(data1)=1 Hash2(data1)=8 Hash3(data1)=13
All three bits are 1, so data1 may exist (and we know it actually does).
Hash1(data3)=2 Hash2(data3)=8 Hash3(data3)=13
All three bits are also 1, but they were set by data1 and data2 , so data3 is a false positive.
Hash1(data4)=1 Hash2(data4)=8 Hash3(data4)=12
Bit 12 is 0, therefore data4 definitely does not exist.
4. Bloom Filter False‑Positive Rate
When querying data3 we observed a false positive. False positives are inevitable; we aim to minimize them. The following sections derive the false‑positive probability.
4.1 Parameters
m: length of the bit array. n: number of inserted elements. k: number of hash functions.
4.2 Derivation Process
The probability that a particular bit remains 0 after one hash is (1‑1/m) . After k independent hashes the probability becomes (1‑1/m)^k . Extending to n inserted elements, the probability that a given bit stays 0 is (1‑1/m)^{kn} ≈ e^{-kn/m} . Consequently, the probability that a bit is 1 is 1‑e^{-kn/m} .
4.3 False‑Positive Formula
The false‑positive probability is the chance that all k bits checked for an absent element are 1, i.e., (1‑e^{-kn/m})^k . When m and n are fixed, the optimal number of hash functions that minimizes the false‑positive rate is k = (m/n)·ln2 . Conversely, for a given false‑positive rate and k , the required bit length is m = -(k·n)/ln(1‑p) . The formula shows that increasing n raises the false‑positive rate, while increasing m lowers it.
5. Implementation Methods
5.1 Guava Implementation
Guava (Google’s core libraries) provides a ready‑made Bloom filter implementation, avoiding the need to reinvent the wheel.
Method Name
Function
Parameters
Return Type
put
Add element
put(T object)
boolean
mightContain
Check if element may exist
mightContain(T object)
boolean
copy
Create a new BloomFilter from this instance
copy()
BloomFilter
approximateElementCount
Number of elements added
approximateElementCount()
long
expectedFpp
Return false‑positive probability
expectedFpp()
double
isCompatible
Check compatibility with another filter
isCompatible(BloomFilter that)
boolean
putAll
Merge another Bloom filter via bitwise OR
putAll(BloomFilter that)
void
Dependency
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>23.0</version>
</dependency>Test code
private static void GuavaBloomFilter() {
// Create Bloom filter object
BloomFilter bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charset.defaultCharset()), EXPECTED_INSERTIONS, FALSE_PROBABILITY);
// Add elements
bloomFilter.put("element001");
bloomFilter.put("element003");
// Check existence
System.out.println(bloomFilter.mightContain("element001")); // true
System.out.println(bloomFilter.mightContain("element002")); // false
System.out.println(bloomFilter.approximateElementCount()); // 2
System.out.println(bloomFilter.expectedFpp());
}5.2 Redis Implementation
Open‑source Redisson (RBloomFilter).
Redis 4.0 built‑in Bloom filter module.
Custom implementation using Redis bitmap.
5.2.1 Redisson
Method table
Method Name
Function
Parameters
Return Value
add
Add element
add(T object)
boolean
contains
Check element existence
contains(T object)
boolean
count
Number of elements added
count()
long
getExpectedInsertions
Expected number of insertions
getExpectedInsertions()
long
getFalseProbability
False‑positive probability
getFalseProbability()
double
getHashIterations
Number of hash iterations per element
getHashIterations()
int
getSize
Number of bits required in Redis
getSize()
long
tryInit
Initialize filter parameters
tryInit(long expectedInsertions, double falseProbability)
boolean
delete
Delete object
delete()
boolean
Dependency
<dependency>
<groupId>org.redisson</groupId>
<artifactId>redisson</artifactId>
<version>3.22.1</version>
</dependency>Test code
private static void RedissonBloomFilter() {
Config config = new Config();
config.useSingleServer().setAddress("redis://" + REDIS_IP + ":" + REDIS_PORT);
config.useSingleServer().setPassword(REDIS_PASSWORD);
RedissonClient redissonClient = Redisson.create(config);
RBloomFilter
bloomFilter = redissonClient.getBloomFilter(BLOOM_FILTER_NAME);
bloomFilter.tryInit(EXPECTED_INSERTIONS, FALSE_PROBABILITY);
bloomFilter.add("element001");
bloomFilter.add("element003");
System.out.println(bloomFilter.contains("element002")); // false
System.out.println(bloomFilter.contains("element001")); // true
System.out.println(bloomFilter.count()); // 2
System.out.println(bloomFilter.getExpectedInsertions()); // 1000000
System.out.println(bloomFilter.getFalseProbability()); // 0.01
System.out.println(bloomFilter.getHashIterations());
System.out.println(bloomFilter.getSize());
}5.2.2 Redis 4.0 Built‑in Bloom Filter Module
Basic commands
Command
Function
Parameters
BF.RESERVE
Create an empty Bloom filter with given capacity and error rate
BF.RESERVE {key} {error_rate} {capacity} [EXPANSION {expansion}] [NONSCALING]
BF.ADD
Add an item to the filter
BF.ADD {key} {item}
BF.MADD
Add multiple items
BF.MADD {key} {item ...}
BF.INSERT
Add items with optional capacity and error rate, optionally auto‑create
BF.INSERT {key} [CAPACITY {cap}] [ERROR {error}] [EXPANSION {expansion}] [NOCREATE] [NONSCALING] ITEMS {item ...}
BF.EXISTS
Check if an item may exist
BF.EXISTS {key} {item}
BF.MEXISTS
Check multiple items
BF.MEXISTS {key} {item ...}
BF.SCANDUMP
Incrementally dump filter data
BF.SCANDUMP {key} {iter}
BF.LOADCHUNK
Load a dumped chunk
BF.LOADCHUNK {key} {iter} {data}
BF.INFO
Get filter information
BF.INFO {key}
BF.DEBUG
Show internal details
BF.DEBUG {key}
Dependency
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>4.2.0</version>
</dependency>Test code
private static void RedisBloomFilter() {
BloomFilterCommands bloomFilterCommands = new JedisPooled(REDIS_IP, REDIS_PORT, "", REDIS_PASSWORD);
BFReserveParams bfReserveParams = new BFReserveParams();
bfReserveParams.expansion(2);
String test = bloomFilterCommands.bfReserve(BLOOM_FILTER_NAME, FALSE_PROBABILITY, EXPECTED_INSERTIONS, bfReserveParams);
bloomFilterCommands.bfAdd(BLOOM_FILTER_NAME, "element001");
bloomFilterCommands.bfAdd(BLOOM_FILTER_NAME, "element003");
System.out.println(bloomFilterCommands.bfExists(BLOOM_FILTER_NAME, "element001")); // true
System.out.println(bloomFilterCommands.bfExists(BLOOM_FILTER_NAME, "element002")); // false
}5.2.3 Custom Bitmap Implementation
Method table
Method Name
Function
Parameters
Return Value
add
Add element
add(String key, String element, int expireSec)
boolean
contains
Check element existence
contains(String key, String element)
boolean
getExpectedInsertions
Expected number of elements
getExpectedInsertions()
long
getFalseProbability
False‑positive probability
getFalseProbability()
double
getNumHashFunctions
Number of hash functions
getNumHashFunctions()
int
getBitmapLength
Length of bitmap
getBitmapLength()
long
BloomFilterUtils
Create Bloom filter object
BloomFilterUtils(long expectedInsertions, double falseProbability)
BloomFilterUtils
public class BloomFilterUtils {
private static final String BF_KEY_PREFIX = "bf_";
private long numApproxElements;
private double falseProbability;
private int numHashFunctions;
private int bitmapLength;
private JedisResourcePool jedisResourcePool;
// Constructor and methods omitted for brevity – see original source for full implementation.
public static void main(String[] args) {
// Example usage – omitted for brevity.
}
}6. Business Use of Bloom Filter
6.1 Business Scenarios
In the "C1 video exposure" activity, we need to quickly determine whether a user has any on‑shelf products to decide whether to show an activity entry on the personal‑center page. With hundreds of millions of users and millions of on‑shelf product users, querying the product service for each user would cause massive overhead. Using a Bloom filter reduces memory consumption to dozens of MB compared with several GB in Redis, greatly improving efficiency.
6.2 Choosing an Implementation
Implementation
Storage Location
Applicable Scenario
Remarks
Guava
Machine memory
Single‑machine
Only needs local memory, no external resources.
Redisson
Redis
Distributed
Connect to Redis and use.
Redis plugin
Redis
Distributed
Requires Redis cluster support for Bloom filter module.
Redis bitmap
Redis
Distributed
Custom implementation needed.
For a distributed service environment, Guava is unsuitable, and the Redis plugin requires complex configuration and high cost. Redisson, which connects to Redis for insertion and query, is the most appropriate choice.
6.3 Effect of Using Bloom Filter
1. Memory consumption
At launch, storing qualified users in Codis increased memory from 1.98 GB to 9.52 GB (7.54 GB used). After optimization, user volume was reduced 25‑fold, bringing memory down to 308.8 MB. Switching to Redisson Bloom filter increased Redis memory from 2.7172 GB to 2.7566 GB, while the data occupied only 39.4 MB.
Memory usage before/after using Codis:
Before insertion: After insertion:
Memory usage before/after using Bloom filter:
Before insertion: After insertion:
2. Service performance
Using Bloom filter reduces the number of calls to the product service, decreasing overall latency by about 5 ms.
Author: Li Shuaqi, backend engineer at Zhuanzhuan, responsible for commercial C‑end systems (ad retrieval, billing, feature engineering, etc.).
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.