Backend Development 28 min read

Comprehensive Guide to Bloom Filters: Principles, Implementations, and Business Applications

This article introduces Bloom filters, explains their probabilistic principles, advantages and drawbacks, details how to add and query elements, derives false‑positive formulas, provides Guava, Redisson, Redis‑module, and custom bitmap implementations with code samples, and showcases real‑world business scenarios and performance benefits.

Zhuanzhuan Tech

Mar 22, 2024

Comprehensive Guide to Bloom Filters: Principles, Implementations, and Business Applications

1. Background Introduction

In the Internet we often need to determine whether a target datum exists among massive data, such as checking if a URL has already been visited by a web crawler. Storing such data on disk incurs heavy I/O; keeping it in memory works for small volumes but becomes prohibitive when the number of entries reaches tens of millions. Bloom filter offers a space‑efficient, high‑performance solution.

2. Understanding Bloom Filter

2.1 What is a Bloom Filter

Bloom filter, proposed by Bloom in 1970, is essentially a long binary vector together with a set of random hash functions.

It is a probabilistic data structure that tells you that an element definitely does not exist or may exist.

2.2 Advantages

Compared with other data structures, Bloom filter has huge advantages in both space and time; its storage size and insertion/query time are constant (i.e., the number of hash functions).

Hash functions are independent, which facilitates hardware‑level parallelism.

Bloom filter does not store the elements themselves, making it suitable for highly confidential scenarios.

Bloom filter can represent a universal set, which other data structures cannot.

2.3 Disadvantages

There is a false‑positive rate.

Deletion is not supported.

2.4 Suitable Scenarios

Prevent cache penetration: quickly decide whether data exists to avoid unnecessary database queries.

Web crawling: de‑duplicate already crawled URLs.

Email spam filtering.

Black‑white list checks.

3. Bloom Filter Principle

3.1 Structure

Bloom filter consists of a huge bit array and multiple independent hash functions. Assume the bit array length is m and the number of hash functions is k. The figure below shows a 16‑bit array with three different hash functions; each bit stores 0 or 1 and is initially 0.

Different hash functions

Specified length array

3.2 Adding Elements

When adding an element, compute k hash values, then set the k corresponding bits to 1. Example with two elements data1 and data2:

Hash1(data1)=1 Hash2(data1)=8 Hash3(data1)=13

Thus positions 1, 8, 13 are set to 1.

For data2 the hash results are 2, 5, 13, so positions 2, 5, 13 are set to 1.

Hash1(data2)=2 Hash2(data2)=5 Hash3(data2)=13

Because different elements may share some hash positions, multiple hash functions are needed to keep the probability of collisions low.

3.3 Querying Elements

To query an element, compute its k hash values and check the corresponding bits. If any bit is 0, the element definitely does not exist; if all bits are 1, the element may exist.

Example queries:

Hash1(data1)=1 Hash2(data1)=8 Hash3(data1)=13

All three bits are 1, so data1 may exist (and we know it actually does).

Hash1(data3)=2 Hash2(data3)=8 Hash3(data3)=13

All three bits are also 1, but they were set by data1 and data2, so data3 is a false positive.

Hash1(data4)=1 Hash2(data4)=8 Hash3(data4)=12

Bit 12 is 0, therefore data4 definitely does not exist.

4. Bloom Filter False‑Positive Rate

When querying data3 we observed a false positive. False positives are inevitable; we aim to minimize them. The following sections derive the false‑positive probability.

4.1 Parameters

m: length of the bit array. n: number of inserted elements. k: number of hash functions.

4.2 Derivation Process

The probability that a particular bit remains 0 after one hash is (1‑1/m). After k independent hashes the probability becomes (1‑1/m)^k. Extending to n inserted elements, the probability that a given bit stays 0 is (1‑1/m)^{kn} ≈ e^{-kn/m}. Consequently, the probability that a bit is 1 is 1‑e^{-kn/m}.

4.3 False‑Positive Formula

The false‑positive probability is the chance that all k bits checked for an absent element are 1, i.e., (1‑e^{-kn/m})^k. When m and n are fixed, the optimal number of hash functions that minimizes the false‑positive rate is k = (m/n)·ln2. Conversely, for a given false‑positive rate and k, the required bit length is m = -(k·n)/ln(1‑p). The formula shows that increasing n raises the false‑positive rate, while increasing m lowers it.

5. Implementation Methods

5.1 Guava Implementation

Guava (Google’s core libraries) provides a ready‑made Bloom filter implementation, avoiding the need to reinvent the wheel.

Method Name

Function

Parameters

Return Type

put

Add element

put(T object)

boolean

mightContain

Check if element may exist

mightContain(T object)

boolean

copy

Create a new BloomFilter from this instance

copy()

BloomFilter

approximateElementCount

Number of elements added

approximateElementCount()

long

expectedFpp

Return false‑positive probability

expectedFpp()

double

isCompatible

Check compatibility with another filter

isCompatible(BloomFilter that)

boolean

putAll

Merge another Bloom filter via bitwise OR

putAll(BloomFilter that)

void

Dependency

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>23.0</version>
</dependency>

Test code

private static void GuavaBloomFilter() {
    // Create Bloom filter object
    BloomFilter bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charset.defaultCharset()), EXPECTED_INSERTIONS, FALSE_PROBABILITY);
    // Add elements
    bloomFilter.put("element001");
    bloomFilter.put("element003");
    // Check existence
    System.out.println(bloomFilter.mightContain("element001")); // true
    System.out.println(bloomFilter.mightContain("element002")); // false
    System.out.println(bloomFilter.approximateElementCount()); // 2
    System.out.println(bloomFilter.expectedFpp());
}

5.2 Redis Implementation

Open‑source Redisson (RBloomFilter).

Redis 4.0 built‑in Bloom filter module.

Custom implementation using Redis bitmap.

5.2.1 Redisson

Method table

Method Name

Function

Parameters

Return Value

add

Add element

add(T object)

boolean

contains

Check element existence

contains(T object)

boolean

count

Number of elements added

count()

long

getExpectedInsertions

Expected number of insertions

getExpectedInsertions()

long

getFalseProbability

False‑positive probability

getFalseProbability()

double

getHashIterations

Number of hash iterations per element

getHashIterations()

int

getSize

Number of bits required in Redis

getSize()

long

tryInit

Initialize filter parameters

tryInit(long expectedInsertions, double falseProbability)

boolean

delete

Delete object

delete()

boolean

Dependency

<dependency>
    <groupId>org.redisson</groupId>
    <artifactId>redisson</artifactId>
    <version>3.22.1</version>
</dependency>

Test code

private static void RedissonBloomFilter() {
    Config config = new Config();
    config.useSingleServer().setAddress("redis://" + REDIS_IP + ":" + REDIS_PORT);
    config.useSingleServer().setPassword(REDIS_PASSWORD);
    RedissonClient redissonClient = Redisson.create(config);
    RBloomFilter<String> bloomFilter = redissonClient.getBloomFilter(BLOOM_FILTER_NAME);
    bloomFilter.tryInit(EXPECTED_INSERTIONS, FALSE_PROBABILITY);
    bloomFilter.add("element001");
    bloomFilter.add("element003");
    System.out.println(bloomFilter.contains("element002")); // false
    System.out.println(bloomFilter.contains("element001")); // true
    System.out.println(bloomFilter.count()); // 2
    System.out.println(bloomFilter.getExpectedInsertions()); // 1000000
    System.out.println(bloomFilter.getFalseProbability()); // 0.01
    System.out.println(bloomFilter.getHashIterations());
    System.out.println(bloomFilter.getSize());
}

5.2.2 Redis 4.0 Built‑in Bloom Filter Module

Basic commands

Command

Function

Parameters

BF.RESERVE

Create an empty Bloom filter with given capacity and error rate

BF.RESERVE {key} {error_rate} {capacity} [EXPANSION {expansion}] [NONSCALING]

BF.ADD

Add an item to the filter

BF.ADD {key} {item}

BF.MADD

Add multiple items

BF.MADD {key} {item ...}

BF.INSERT

Add items with optional capacity and error rate, optionally auto‑create

BF.INSERT {key} [CAPACITY {cap}] [ERROR {error}] [EXPANSION {expansion}] [NOCREATE] [NONSCALING] ITEMS {item ...}

BF.EXISTS

Check if an item may exist

BF.EXISTS {key} {item}

BF.MEXISTS

Check multiple items

BF.MEXISTS {key} {item ...}

BF.SCANDUMP

Incrementally dump filter data

BF.SCANDUMP {key} {iter}

BF.LOADCHUNK

Load a dumped chunk

BF.LOADCHUNK {key} {iter} {data}

BF.INFO

Get filter information

BF.INFO {key}

BF.DEBUG

Show internal details

BF.DEBUG {key}

Dependency

<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
    <version>4.2.0</version>
</dependency>

Test code

private static void RedisBloomFilter() {
    BloomFilterCommands bloomFilterCommands = new JedisPooled(REDIS_IP, REDIS_PORT, "", REDIS_PASSWORD);
    BFReserveParams bfReserveParams = new BFReserveParams();
    bfReserveParams.expansion(2);
    String test = bloomFilterCommands.bfReserve(BLOOM_FILTER_NAME, FALSE_PROBABILITY, EXPECTED_INSERTIONS, bfReserveParams);
    bloomFilterCommands.bfAdd(BLOOM_FILTER_NAME, "element001");
    bloomFilterCommands.bfAdd(BLOOM_FILTER_NAME, "element003");
    System.out.println(bloomFilterCommands.bfExists(BLOOM_FILTER_NAME, "element001")); // true
    System.out.println(bloomFilterCommands.bfExists(BLOOM_FILTER_NAME, "element002")); // false
}

5.2.3 Custom Bitmap Implementation

Method table

Method Name

Function

Parameters

Return Value

add

Add element

add(String key, String element, int expireSec)

boolean

contains

Check element existence

contains(String key, String element)

boolean

getExpectedInsertions

Expected number of elements

getExpectedInsertions()

long

getFalseProbability

False‑positive probability

getFalseProbability()

double

getNumHashFunctions

Number of hash functions

getNumHashFunctions()

int

getBitmapLength

Length of bitmap

getBitmapLength()

long

BloomFilterUtils

Create Bloom filter object

BloomFilterUtils(long expectedInsertions, double falseProbability)

BloomFilterUtils

public class BloomFilterUtils {
    private static final String BF_KEY_PREFIX = "bf_";
    private long numApproxElements;
    private double falseProbability;
    private int numHashFunctions;
    private int bitmapLength;
    private JedisResourcePool jedisResourcePool;
    // Constructor and methods omitted for brevity – see original source for full implementation.
    public static void main(String[] args) {
        // Example usage – omitted for brevity.
    }
}

6. Business Use of Bloom Filter

6.1 Business Scenarios

In the "C1 video exposure" activity, we need to quickly determine whether a user has any on‑shelf products to decide whether to show an activity entry on the personal‑center page. With hundreds of millions of users and millions of on‑shelf product users, querying the product service for each user would cause massive overhead. Using a Bloom filter reduces memory consumption to dozens of MB compared with several GB in Redis, greatly improving efficiency.

6.2 Choosing an Implementation

Implementation

Storage Location

Applicable Scenario

Remarks

Guava

Machine memory

Single‑machine

Only needs local memory, no external resources.

Redisson

Redis

Distributed

Connect to Redis and use.

Redis plugin

Redis

Distributed

Requires Redis cluster support for Bloom filter module.

Redis bitmap

Redis

Distributed

Custom implementation needed.

For a distributed service environment, Guava is unsuitable, and the Redis plugin requires complex configuration and high cost. Redisson, which connects to Redis for insertion and query, is the most appropriate choice.

6.3 Effect of Using Bloom Filter

1. Memory consumption

At launch, storing qualified users in Codis increased memory from 1.98 GB to 9.52 GB (7.54 GB used). After optimization, user volume was reduced 25‑fold, bringing memory down to 308.8 MB. Switching to Redisson Bloom filter increased Redis memory from 2.7172 GB to 2.7566 GB, while the data occupied only 39.4 MB.

Memory usage before/after using Codis:

Before insertion: After insertion:

Memory usage before/after using Bloom filter:

Before insertion: After insertion:

2. Service performance

Using Bloom filter reduces the number of calls to the product service, decreasing overall latency by about 5 ms.

Author: Li Shuaqi, backend engineer at Zhuanzhuan, responsible for commercial C‑end systems (ad retrieval, billing, feature engineering, etc.).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Redis Guava bloom filter probabilistic data structure

Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.