Understanding Bloom Filter: Concept, Principles, Implementation, and Applications
This article explains the concept, principles, implementation details, and practical applications of Bloom Filters, including formulas for optimal bit array size and hash count, Java code examples using Guava, and common use cases such as deduplication, web crawling, and spam filtering.
Algorithm Background
If you need to determine whether an element belongs to a set, a common approach is to store all elements and compare. Data structures like linked lists, trees, and hash tables follow this idea, trading time for space or vice versa. When the dataset grows, storage and lookup become costly.
In scenarios with strict response time requirements, increasing elements leads to larger memory consumption and slower retrieval. The challenge is to find a data structure that offers both low time and space overhead. Bloom Filter provides such a solution.
Bloom Filter Concept
A Bloom Filter, proposed by Bloom in 1970, is essentially a long binary vector combined with multiple random hash functions. It can test whether an element is in a set with high space efficiency and fast query time, at the cost of a certain false‑positive rate and difficulty in deletion.
Bloom Filter Principle
When an element is added, K hash functions map it to K positions in a bit array and set those bits to 1. To query, the same K positions are checked: if any bit is 0 the element is definitely absent; if all are 1 the element is probably present.
Unlike a single‑hash bitmap, a Bloom Filter uses K independent hash functions, reducing collision probability.
Bloom Filter Drawbacks
Bloom Filters sacrifice accuracy and deletion convenience for efficiency.
False positives may occur; a queried element might not be in the set even though all K bits are 1. A whitelist can mitigate this.
Deletion is hard because clearing bits may affect other elements. Counting Bloom Filters address this issue.
Bloom Filter Implementation
Guava provides a ready‑made Bloom Filter implementation. Two essential parameters are the expected number of items n and the desired false‑positive probability fpp. From these, the bit array size m and the number of hash functions k are derived.
(1) Bit Array Size Selection
The optimal size m is calculated as:
(2) Hash Function Selection
The optimal number of hash functions k is:
Choosing good hash functions greatly impacts performance. A common technique is to use a single high‑quality hash (e.g., MurmurHash3) with different seeds to simulate multiple functions.
Code Example: Optimal Number of Bits
/**
* 计算 Bloom Filter的bit位数m
*
* @param n 预期数据量
* @param p 误判率 (must be 0 < p < 1)
*/
@VisibleForTesting
static long optimalNumOfBits(long n, double p) {
if (p == 0) {
p = Double.MIN_VALUE;
}
return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}Code Example: Optimal Number of Hash Functions
/**
* 计算最佳k值,即在Bloom过滤器中插入的每个元素的哈希数
*
* @param n 预期数据量
* @param m bloom filter中总的bit位数 (must be positive)
*/
@VisibleForTesting
static int optimalNumOfHashFunctions(long n, long m) {
// (m / n) * log(2), but avoid truncation due to division!
return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}Putting data into a Bloom Filter uses MurmurHash3 to obtain a 128‑bit hash, splits it into two 64‑bit values, and combines them to set the appropriate bits.
Code Example: Put Method
public <T> boolean put(T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
// 利用MurmurHash3得到数据的hash值对应的字节数组
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
// 取低8个字节、高8个字节,转成long类型
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
boolean bitsChanged = false;
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
combinedHash += hash2;
}
return bitsChanged;
}The corresponding mightContain method follows the same hashing logic for membership queries.
Bloom Filter Usage
A simple demo shows how to create and query a Bloom Filter using Guava, similar to using a HashMap.
package com.qunar.sage.wang.common.bloom.filter;
import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnel;
import com.google.common.hash.Funnels;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.ToString;
public class BloomFilterTest {
public static void main(String[] args) {
long expectedInsertions = 10000000;
double fpp = 0.00001;
BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), expectedInsertions, fpp);
bloomFilter.put("aaa");
bloomFilter.put("bbb");
boolean containsString = bloomFilter.mightContain("aaa");
System.out.println(containsString);
BloomFilter<Email> emailBloomFilter = BloomFilter.create((Funnel<Email>) (from, into) -> into.putString(from.getDomain(), Charsets.UTF_8), expectedInsertions, fpp);
emailBloomFilter.put(new Email("sage.wang", "quanr.com"));
boolean containsEmail = emailBloomFilter.mightContain(new Email("sage.wangaaa", "quanr.com"));
System.out.println(containsEmail);
}
@Data
@Builder
@ToString
@AllArgsConstructor
public static class Email {
private String userName;
private String domain;
}
}Bloom Filter Applications
Monitoring systems can use Bloom Filters to deduplicate metric names before storing them.
Crawlers filter already‑visited URLs to avoid re‑crawling.
Spam email filtering: a Bloom Filter can represent billions of addresses using a fraction of the memory required by a traditional hash table.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
