Fundamentals 11 min read

How to Choose the Optimal Bloom Filter Size and Hash Count for Low False Positives

This article walks through the mathematics of Bloom filters, showing how to model false‑positive probability, derive optimal bit array size and hash‑function count, and apply the formulas to a 4‑million‑item dataset with concrete examples and performance tables.

BirdNest Tech Talk

Dec 4, 2024

How to Choose the Optimal Bloom Filter Size and Hash Count for Low False Positives

Bloom Filter Basics

A Bloom filter is a probabilistic data structure that tests whether an element is in a set using a bit array and a set of non‑cryptographic hash functions. It can answer "definitely not" with 100% certainty and "probably in" with a configurable false‑positive rate.

Insertion Example

filter.insert(item_1) // => {18, 28, 26}
filter.insert(item_2) // => {4, 7, 24}
filter.insert(item_3) // => {21, 16, 24}

Each insertion runs all hash functions, treats the resulting numbers as indices in the bit array, and sets those bits to 1. After inserting three items, the bit array contains a mixture of 0s and 1s reflecting the hashed positions.

Query Example

filter.check(unknown_1) // => NOT IN THE DATASET

The check function runs the same hash functions on the query element. If any of the indexed bits is 0, the element is definitely not in the set. For unknown_1 one hash maps to a 0 bit, so the filter correctly reports absence.

filter.check(unknown_2) // => PROBABLY IN THE DATASET

All three hashed positions for unknown_2 happen to be 1, so the filter reports a possible presence even though the element was never inserted—a false positive caused by bit collisions.

False‑Positive Probability Derivation

Let m be the number of bits in the filter, k the number of hash functions, and n the number of inserted items. After inserting n items, the probability that a particular bit remains 0 is (1 - 1/m)^{kn} ≈ e^{-kn/m}.

Consequently, the probability that a bit is 1 is 1 - e^{-kn/m}. When querying an element not in the set, all k hashed bits must be 1, giving a false‑positive rate of (1 - e^{-kn/m})^{k}.

Optimizing the Number of Hash Functions

Taking the derivative of the false‑positive expression with respect to k and setting it to zero yields the optimal number of hash functions for given m and n: k_opt = (m / n) * ln 2.

Optimizing the Bit Array Size for a Target Rate

Re‑arranging the false‑positive formula to solve for m gives m = -(n * k) / ln(1 - p), where p is the desired false‑positive probability.

Concrete Example: 4 Million Items

Assume n ≈ 4,000,000. Trying a 3.125 Mb filter (≈3,125,000 bits) yields k ≈ 4 and a false‑positive rate of about 5 %.

Increasing the size and adjusting k improves the rate:

3.75 Mb, k = 5 → 2.7 % false‑positive

4.79 Mb, k = 6 → 1 % false‑positive

6.25 Mb, k = 8 → 0.25 % false‑positive

The table above (shown in the image) visualises the trade‑off between memory size, hash count, and error rate.

Final Calculation

For a target false‑positive rate of 1 % with n = 4,000,000, the optimal parameters are roughly m = 4.79 Mb (≈38,340,233 bits) and k = 6. This configuration yields a false‑positive probability of about 1 % while using only a few megabytes of memory.

The article concludes with a thought‑experiment: because Bloom filters cannot delete bits (multiple items share bits), how might one represent deletions?

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Memory optimization algorithm analysis Bloom Filter false positive hash functions probabilistic data structures

Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Bloom Filter Basics

Insertion Example

Query Example

False‑Positive Probability Derivation

Optimizing the Number of Hash Functions

Optimizing the Bit Array Size for a Target Rate

Concrete Example: 4 Million Items

Final Calculation

BirdNest Tech Talk

How this landed with the community

Was this worth your time?

0 Comments

Concrete Example: 4 Million Items