Big Data 21 min read

Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis

Bloom filters are space‑efficient probabilistic structures that answer “definitely not” or “maybe” membership queries, with a controllable false‑positive rate derived from bit array size, element count, and hash functions, and can be implemented via Guava’s Java library, Redisson’s Redis wrapper, native Redis modules, or custom bitmap code, dramatically reducing memory usage and latency in large‑scale systems such as URL deduplication or user‑product checks.

Sohu Tech Products

Apr 10, 2024

Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis

1. Background

In many Internet scenarios we need to quickly determine whether a target datum exists among massive data sets (e.g., checking if a URL has been crawled). Storing such flags directly in disk leads to heavy I/O, while keeping them in memory with a simple key‑value store (e.g., Redis) becomes prohibitive when the volume reaches tens of millions. Bloom Filter provides a space‑efficient, high‑throughput solution.

2. Understanding Bloom Filter

2.1 What is a Bloom Filter?

A Bloom Filter, proposed by Bloom in 1970, is essentially a long bit array together with several independent hash functions.

It is a probabilistic data structure that can answer the question: an element is definitely not present or possibly present .

2.2 Advantages

Constant space and O(1) insertion/query time (the number of hash functions is fixed).

Hash functions are independent and can be parallelized in hardware.

No need to store the actual elements, which is useful for privacy‑sensitive scenarios.

Can represent a universal set that no other structure can.

2.3 Disadvantages

There is a non‑zero false‑positive rate.

Elements cannot be deleted.

2.4 Typical Use Cases

Preventing cache penetration.

Web crawler URL deduplication.

Email spam filtering.

Black‑/white‑list checks.

3. Bloom Filter Mechanics

3.1 Structure

A Bloom Filter consists of a large bit array of length m and k different hash functions. For example, a 16‑bit array with three hash functions.

3.2 Adding Elements

When adding an element, each of the k hash functions produces a position in the bit array, and those bits are set to 1. Example with two elements:

Hash1(data1)=1
Hash2(data1)=8
Hash3(data1)=13

Hash1(data2)=2
Hash2(data2)=5
Hash3(data2)=13

Both elements share the bit at position 13, illustrating why multiple hash functions are needed to minimise collisions.

3.3 Querying Elements

To test membership, compute the k hash positions. If any of those bits is 0, the element is definitely absent; if all are 1, the element may be present.

Hash1(data1)=1
Hash2(data1)=8
Hash3(data1)=13

Hash1(data3)=2
Hash2(data3)=8
Hash3(data3)=13

Hash1(data4)=1
Hash2(data4)=8
Hash3(data4)=12

Data4’s third hash maps to bit 12, which is 0, therefore it is definitely not in the filter.

4. False‑Positive Rate

The false‑positive probability depends on three parameters: m: number of bits in the array. n: number of inserted elements. k: number of hash functions.

Derivation (omitted for brevity) leads to the classic formula: p = (1 - e^{-kn/m})^k Given m and n, the optimal k that minimises p is k = (m/n) * ln2. Conversely, for a target false‑positive rate, the required bit length is m = -(n * ln p) / (ln 2)^2.

5. Implementation Methods

5.1 Guava (Java)

Google Guava provides a ready‑made Bloom Filter implementation.

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>23.0</version>
</dependency>

Typical usage:

private static void GuavaBloomFilter() {
    // Create a BloomFilter instance
    BloomFilter<String> bloomFilter = BloomFilter.create(
        Funnels.stringFunnel(Charset.defaultCharset()),
        EXPECTED_INSERTIONS,
        FALSE_PROBABILITY);
    // Add elements
    bloomFilter.put("element001");
    bloomFilter.put("element003");
    // Query
    System.out.println(bloomFilter.mightContain("element001")); // true
    System.out.println(bloomFilter.mightContain("element002")); // false
    System.out.println(bloomFilter.approximateElementCount()); // 2
    System.out.println(bloomFilter.expectedFpp());
}

5.2 Redis‑based Implementations

5.2.1 Redisson (Java client)

Redisson offers a Bloom Filter wrapper that works on a Redis cluster.

<dependency>
    <groupId>org.redisson</groupId>
    <artifactId>redisson</artifactId>
    <version>3.22.1</version>
</dependency>

private static void RedissonBloomFilter() {
    Config config = new Config();
    config.useSingleServer().setAddress("redis://" + REDIS_IP + ":" + REDIS_PORT);
    config.useSingleServer().setPassword(REDIS_PASSWORD);
    RedissonClient client = Redisson.create(config);
    RBloomFilter<String> filter = client.getBloomFilter("myFilter");
    filter.tryInit(EXPECTED_INSERTIONS, FALSE_PROBABILITY);
    filter.add("element001");
    filter.add("element003");
    System.out.println(filter.contains("element002")); // false
    System.out.println(filter.contains("element001")); // true
    System.out.println(filter.count()); // 2
    System.out.println(filter.getExpectedInsertions());
    System.out.println(filter.getFalseProbability());
    System.out.println(filter.getHashIterations());
    System.out.println(filter.getSize());
}

5.2.2 Redis 4.0 Bloom Filter Module

Redis 4.0 ships with native Bloom Filter commands (BF.RESERVE, BF.ADD, BF.EXISTS, etc.). Example command list is provided in the source.

5.2.3 Custom Bitmap Implementation

A DIY approach using Redis BITMAP and Java hashing (MurMurHash3) is also described. The full source code is included in the original article.

public class BloomFilterUtils {
    // ... (fields, constructor, hash calculation, add, contains, main)
}

6. Business Application

6.1 Scenario

In a video‑exposure campaign, the system needs to decide whether a user has any “on‑shelf” products. With billions of users and millions of product‑owner users, a Bloom Filter reduces memory from several gigabytes (when using plain Redis sets) to a few dozen megabytes.

6.2 Choice of Implementation

Implementation

Storage

Applicable Scenario

Remarks

Guava

Local machine

Single‑node

No external resources needed

Redisson

Redis

Distributed

Just connect to Redis

Redis plugin

Redis

Distributed

Requires cluster‑wide plugin support

Redis bitmap

Redis

Distributed

Custom implementation required

For a distributed service environment, Guava is unsuitable, the Redis plugin is costly to configure, while Redisson offers the best trade‑off.

6.3 Effectiveness

Memory consumption dropped dramatically after switching to Bloom Filters (e.g., from 9.52 GB to 308.8 MB for the user set, and from 2.717 GB to 2.756 GB for Redis with only 39.4 MB of actual filter data). Additionally, the end‑to‑end latency of the user‑center page decreased by about 5 ms due to fewer service‑to‑service calls.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Big Data Redis Guava bloom filter probabilistic data structure false-positive-rate

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.