Databases 13 min read

Scaling Bloom Filter for 800 Million OpenIDs in Redis

This article explains how to use a Bloom filter backed by Redis bitmap and Roaring Bitmap sharding to efficiently filter 800 million OpenID queries, covering memory planning, hash function selection, code implementation, and performance‑tuned batch write strategies.

dbaplus Community

Nov 15, 2023

Scaling Bloom Filter for 800 Million OpenIDs in Redis

Problem Description

In a high‑traffic mini‑program, OpenID is the primary query key. After caching, the system still receives about 10 million queries per day, and roughly 30% (≈3 million) of those OpenIDs do not exist in MySQL, causing wasted lookups.

Solution Idea

The straightforward remedy is to apply a Bloom filter before hitting MySQL, allowing a fast existence check for OpenIDs.

Bloom Filter Basics

Data structure that tests whether an element is in a set.

Multiple hash functions map an element to bits in a bit array; bits are set to 1.

If any mapped bit is 0, the element is definitely absent.

If all bits are 1, the element may be present (false‑positive possibility).

The false‑positive rate depends on the number of hash functions and the size of the bit vector.

Handling an 800 Million‑Element Dataset

Key challenges include memory consumption, hash function planning, and effectiveness validation.

Memory and Bitmap in Redis

Redis implements Bloom filters using its Bitmap type, which is a bit‑array stored as a string. A single string can hold over 2³² bits, giving a theoretical capacity of about 4.29 billion elements.

Memory usage can be inspected with MEMORY USAGE key. Setting a bit at a very high index (e.g., 1 billion) forces Redis to allocate a contiguous 1‑billion‑bit array (~128 MiB), even though most bits remain zero.

SETBIT key index 1

Sharding with Roaring Bitmap

Because the 800 million OpenIDs cannot fit comfortably in a single bitmap, the data is sharded using a Roaring Bitmap‑style approach:

Range partitioning: The 32‑bit key space is divided into 2¹⁶ buckets; each bucket stores the low 16 bits.

Storage: High 16 bits select the bucket (container); low 16 bits are stored as the position within the container.

Query: To test existence, check whether the low‑bits position is set in the corresponding container.

Example: OpenID 821697800 → hex 30FA1D08. High 16 bits 30FA select the container; low 16 bits 1D08 (decimal 7432) set the bit.

Implementation: Extracting High and Low Bits

$container = $hash >> 16;

$index = $hash & 0xFFFF;

These operations isolate the bucket and the bit position for Redis SETBIT commands.

Hash Function Planning

With 800 million elements, a target false‑positive rate of 0.1% suggests roughly 20 billion bits (≈2 GiB) and about 15 hash functions. Testing showed 15 hashes overload a container, causing >50% false positives. Reducing the container count by using 17 high bits (instead of 16) and limiting to 10 hash functions achieved the desired error rate.

Impact of Redis Key Count

Redis stores all keys in a hash table with O(1) lookup, so the additional ~60 k keys for the sharded Bloom filter have negligible read/write impact. The main effect is on RDB backup time, where a larger key space slightly slows snapshot creation.

Batch Write Modes: Multi vs Pipeline

Redis provides two batch execution modes:

Pipeline: Sends multiple commands in one network round‑trip, boosting throughput but without atomicity.

Multi (transaction): Executes commands atomically after EXEC, offering rollback and optimistic locking at the cost of slower performance.

Benchmarking 1 million SETBIT operations showed Pipeline to be ~50× faster than Multi. For the 800 million OpenIDs (≈10 hashes each), Pipeline estimates ~20 hours, whereas Multi would take about a month.

$redis->multi()
    ->set('key1','val1')
    ->get('key1')
    ->set('key2','val2')
    ->get('key2')
    ->exec();

Choosing Pipeline is essential for large‑scale data loading.

Conclusion

By combining a Redis bitmap‑based Bloom filter, Roaring‑style sharding, carefully chosen hash functions, and Pipeline batch writes, the system reduces unnecessary MySQL queries from 30% of 10 million daily calls to near zero, while keeping memory usage and false‑positive rates within acceptable bounds.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

bitmap bloom filter backend optimization hash functions Roaring Bitmap Large-Scale Data

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem Description

Solution Idea

Bloom Filter Basics

Handling an 800 Million‑Element Dataset

Memory and Bitmap in Redis

Sharding with Roaring Bitmap

Implementation: Extracting High and Low Bits

Hash Function Planning

Impact of Redis Key Count

Batch Write Modes: Multi vs Pipeline

Conclusion

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

Handling an 800 Million‑Element Dataset