Unlocking Fast Set Membership: Bloom & Cuckoo Filters Explained
This article introduces Bloom filters and Cuckoo filters, explains their probabilistic nature, false‑positive behavior, space‑time trade‑offs, provides Go and Java implementation examples, and discusses practical use cases such as Redis extensions and high‑traffic e‑commerce scenarios.
Bloom Filter
Bloom filters are probabilistic data structures used to test whether an element is a member of a set. They do not store the actual elements but a hashed representation, guaranteeing no false negatives while allowing false positives, which makes them space‑efficient and fast.
Internally a Bloom filter consists of a bit array. When an element is added, it is hashed and the bits at positions
[hashval % nbits]are set to 1. To query an element, the same hash functions are applied and the corresponding bits are checked; if any bit is 0 the element is definitely absent.
Multiple hash functions reduce collision risk. The number of bits per element is fixed at creation; more bits lower the false‑positive probability. The fill‑rate of the filter also affects accuracy—over‑filled filters increase false positives.
Redis offers an expandable Bloom filter that creates a new, larger filter when capacity is reached, checking across all filters for membership. Insertion time is O(K) where K is the number of hash functions; checking an element in an expanded filter is O(K) or O(K·(n+1)) with n expanded filters.
<code>type BloomFilter struct {
m uint // number of bits
k uint // number of hash functions
b *bitset.BitSet
}
func NewWithEstimates(n uint, fp float64) *BloomFilter {
m, k := EstimateParameters(n, fp)
return New(m, k)
}
func EstimateParameters(n uint, p float64) (m uint, k uint) {
m = uint(math.Ceil(-1 * float64(n) * math.Log(p) / math.Pow(math.Log(2), 2)))
k = uint(math.Ceil(math.Log(2) * float64(m) / float64(n)))
return
}
func New(m uint, k uint) *BloomFilter {
return &BloomFilter{max(1, m), max(1, k), bitset.New(m)}
}
func (f *BloomFilter) Add(data []byte) *BloomFilter {
h := baseHashes(data)
for i := uint(0); i < f.k; i++ {
f.b.Set(f.location(h, i))
}
return f
}
func (f *BloomFilter) Test(data []byte) bool {
h := baseHashes(data)
for i := uint(0); i < f.k; i++ {
if !f.b.Test(f.location(h, i)) {
return false
}
}
return true
}
func (f *BloomFilter) location(h [4]uint64, i uint) uint {
return uint(location(h, i) % uint64(f.m))
}
func location(h [4]uint64, i uint) uint64 {
ii := uint64(i)
return h[ii%2] + ii*h[2+(((ii+(ii%2))%4)/2)]
}
</code>Using Bloom Filters with Redisson and Guava
Redisson provides a simple API for Bloom filters. Example (Java):
<code>private void redisson() {
RedissonClient redissonClient = Redisson.create();
RBloomFilter<Object> bloomFilter = redissonClient.getBloomFilter("bloomFilter");
// Initialize with capacity 1 billion and false‑positive rate 0.001
bloomFilter.tryInit(1000000000, 0.001);
Object obj = new Object();
bloomFilter.add(obj); // add element
boolean exist = bloomFilter.contains(obj); // check existence
}
</code>Guava also offers a thread‑safe Bloom filter implementation (since version 23.0):
<code>import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
public class BloomFilterExample {
public static void main(String[] args) {
BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), 3000000, 0.01);
for (int i = 0; i < 3000000; i++) {
bloomFilter.put(i);
}
for (int i = 0; i < 3001000; i++) {
if (bloomFilter.mightContain(i)) {
System.out.println(i + " might be in the filter.");
} else {
System.out.println(i + " is definitely not in the filter.");
}
}
}
}
</code>Cuckoo Filter
Cuckoo filters address Bloom filter limitations by supporting deletions. They use a bucket array where each bucket stores a small number of fingerprints (partial hashes). An element maps to two candidate buckets; if both are full, an existing element is evicted and re‑hashed, potentially triggering a chain of relocations.
Insertion time is O(1) amortized, but high load factors can degrade performance. Like Bloom filters, Cuckoo filters also produce false positives due to limited bucket space, fingerprint collisions, hash function properties, and load factor.
Key parameters (as used in the popular Golang implementation): each element has 2 candidate buckets, each bucket stores 4 fingerprints, and a fingerprint size of 8 bits yields a false‑positive rate between 0.00001 and 0.002 while keeping space usage high.
<code>// Delete removes a fingerprint from the filter
func (cf *Filter) Delete(data []byte) bool {
i1, fp := getIndexAndFingerprint(data, cf.bucketPow)
if cf.delete(fp, i1) {
return true
}
i2 := getAltIndex(fp, i1, cf.bucketPow)
return cf.delete(fp, i2)
}
</code>Deletion is safe only for elements that were previously inserted; otherwise it may unintentionally remove shared fingerprints, leading to false positives.
Typical Use Cases
Username existence check : preload registered usernames into a filter to quickly reject duplicates.
Ad targeting : store purchased items per user to filter irrelevant ads.
High‑traffic e‑commerce : during large promotions, cache completed orders in a Bloom filter to block invalid detail‑page requests and protect the database.
References
Redis – Bloom filter
GitHub – bloom (Go implementation)
Redisson – Bloom filter
Redis – Cuckoo filter
Linvon – Cuckoo filter paper
COOLSHELL – Cuckoo Filter: Design and Implementation
木鸟杂技 – Cuckoo hashing and Cuckoo filter
GitHub – cuckoofilter
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.