How to Choose the Optimal Bloom Filter Size and Hash Count for Low False Positives
This article walks through the mathematics of Bloom filters, showing how to model false‑positive probability, derive optimal bit array size and hash‑function count, and apply the formulas to a 4‑million‑item dataset with concrete examples and performance tables.
Bloom Filter Basics
A Bloom filter is a probabilistic data structure that tests whether an element is in a set using a bit array and a set of non‑cryptographic hash functions. It can answer "definitely not" with 100% certainty and "probably in" with a configurable false‑positive rate.
Insertion Example
filter.insert(item_1) // => {18, 28, 26}
filter.insert(item_2) // => {4, 7, 24}
filter.insert(item_3) // => {21, 16, 24}Each insertion runs all hash functions, treats the resulting numbers as indices in the bit array, and sets those bits to 1. After inserting three items, the bit array contains a mixture of 0s and 1s reflecting the hashed positions.
Query Example
filter.check(unknown_1) // => NOT IN THE DATASETThe check function runs the same hash functions on the query element. If any of the indexed bits is 0, the element is definitely not in the set. For unknown_1 one hash maps to a 0 bit, so the filter correctly reports absence.
filter.check(unknown_2) // => PROBABLY IN THE DATASETAll three hashed positions for unknown_2 happen to be 1, so the filter reports a possible presence even though the element was never inserted—a false positive caused by bit collisions.
False‑Positive Probability Derivation
Let m be the number of bits in the filter, k the number of hash functions, and n the number of inserted items. After inserting n items, the probability that a particular bit remains 0 is (1 - 1/m)^{kn} ≈ e^{-kn/m}.
Consequently, the probability that a bit is 1 is 1 - e^{-kn/m}. When querying an element not in the set, all k hashed bits must be 1, giving a false‑positive rate of (1 - e^{-kn/m})^{k}.
Optimizing the Number of Hash Functions
Taking the derivative of the false‑positive expression with respect to k and setting it to zero yields the optimal number of hash functions for given m and n: k_opt = (m / n) * ln 2.
Optimizing the Bit Array Size for a Target Rate
Re‑arranging the false‑positive formula to solve for m gives m = -(n * k) / ln(1 - p), where p is the desired false‑positive probability.
Concrete Example: 4 Million Items
Assume n ≈ 4,000,000. Trying a 3.125 Mb filter (≈3,125,000 bits) yields k ≈ 4 and a false‑positive rate of about 5 %.
Increasing the size and adjusting k improves the rate:
3.75 Mb, k = 5 → 2.7 % false‑positive
4.79 Mb, k = 6 → 1 % false‑positive
6.25 Mb, k = 8 → 0.25 % false‑positive
The table above (shown in the image) visualises the trade‑off between memory size, hash count, and error rate.
Final Calculation
For a target false‑positive rate of 1 % with n = 4,000,000, the optimal parameters are roughly m = 4.79 Mb (≈38,340,233 bits) and k = 6. This configuration yields a false‑positive probability of about 1 % while using only a few megabytes of memory.
The article concludes with a thought‑experiment: because Bloom filters cannot delete bits (multiple items share bits), how might one represent deletions?
BirdNest Tech Talk
Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
