How to Build a High‑Performance Redis‑Backed Bloom Filter for Billion‑Scale De‑duplication
This article explains the design and implementation of a space‑efficient Bloom filter using Redis bit operations, covering theory, false‑positive calculations, PHP code, performance testing on billions of daily logs, and optimizations such as pipelining and sharding for large‑scale deduplication.
Background
We have a project that processes click logs at a scale of 1 billion events per day. Logs are collected from front‑end machines via Flume, sent to Kafka, and consumed by Spark, which stores results in Redis. Because Kafka guarantees no message loss, timeouts can cause message re‑delivery or duplicate consumption, so idempotence is required during consumption.
Given the massive deduplication data size (≈1 billion), we allow a controlled false‑positive rate and therefore consider using a Bloom filter, which offers extremely low memory usage and constant‑time operations.
Bloom‑Filter
Introduction
A Bloom filter, proposed by Burton Bloom in 1970, consists of a long bit vector and a set of hash functions. With k hash functions and an m -bit vector, insertion sets k bits to 1, and membership testing checks whether all k bits are 1.
False‑Positive Rate Calculation
The false‑positive probability can be approximated by: P \approx (1 - e^{-\frac{mk}{n}})^{k} When m/n is fixed, the optimal number of hash functions is: k = \frac{m}{n}\ln2 Choosing the nearest integer minimizes the false‑positive probability.
Application Scenarios
Web crawlers: checking whether a URL has already been crawled (large URL set, false‑positive acceptable).
Spam detection: billions of email addresses, occasional false‑positive is tolerable.
Parameters m and k must be selected based on data volume and acceptable false‑positive rate.
False‑Positive‑Ratio Table (including memory usage)
m m/n k FPR FPN Mem
300亿 30 8 9.01e-6 9011 3.49GB
300亿 30 16 7.26e-7 726 3.49GB
500亿 50 8 2.28e-7 228 5.82GB
500亿 50 16 1e-9 1 5.82GBFor a 10‑minute slice (≈7 million records), the table becomes:
m m/n k FPR FPN Mem
2100万 30 8 9.01e-6 63 25.03MB
2100万 30 16 7.26e-7 5 25.03MB
3500万 50 8 2.28e-7 1.6 41.72MB
3500万 50 16 1e-9 0 41.72MBChoosing m/n = 50 and k = 16 satisfies the business requirement with a false‑positive rate of 1e-9.
Redis String SETBIT Method
SETBIT key offset value Sets or clears the bit at the specified offset in the string value stored at key . The offset must be between 0 and 2 32 ‑1 (bit mapping limited to 512 MB). Time complexity O(1). Returns the original bit value.
redis> SETBIT bit 10086 1 (integer) 0
redis> GETBIT bit 10086 (integer) 1
redis> GETBIT bit 100 (integer) 0Since a Bloom filter operates on bits, we can implement it using SETBIT.
PHP Demo
BKDRHash
The BKDR hash function is simple and effective. C implementation:
// BKDR Hash Function
unsigned int BKDRHash(char *str) {
unsigned int seed = 131; // 31, 131, 1313, ...
unsigned int hash = 0;
while (*str) {
hash = hash * seed + (*str++);
}
return (hash & 0x7FFFFFFF);
}We generate k different hash functions by varying the seed.
PHP BKDRHash Implementation
function getBKDRHashSeed($n) {
if ($n === 0) return 31;
$j = $n + 2;
$r = 0;
for ($i = 0; $i < $j; $i++) {
$r = ($i % 2) ? $r * 10 + 3 : $r * 10 + 1;
}
return $r;
}
function BKDRHash($str, $seed) {
$hash = 0;
$len = strlen($str);
for ($i = 0; $i < $len; $i++) {
$hash = ((floatval($hash * $seed) & 0x7FFFFFFF) + ord($str[$i])) & 0x7FFFFFFF;
}
return $hash & 0x7FFFFFFF;
}Bloom Filter Class
class Bf {
public $redis;
public $key;
public $m;
public $k;
public function __construct($key, $m, $k) {
if ($m > 4294967296) {
error_log('ERROR: m over 4294967296');
return false;
}
$this->key = $key;
$this->m = $m;
$this->k = $k;
$this->redis = new Redis();
$this->redis->connect('127.0.0.1', 6379);
}
public function add($e) {
$e = (string)$e;
$c = 0;
for ($i = 0; $i < $this->k; $i++) {
$seed = self::getBKDRHashSeed($i);
$hash = self::BKDRHash($e, $seed);
$offset = $hash % $this->m;
$c += $this->redis->setbit($this->key, $offset, 1);
}
return $c === $this->k;
}
public function flushall() {
return $this->redis->delete($this->key);
}
// static helpers omitted for brevity
}Testing with 10 million random strings (4‑12 characters) produced 7 458 78 duplicates, which were correctly detected by the filter with zero false positives.
Test Results
Total time: 1 h 4 m 45 s
Bloom‑Filter Add QPS: 2 574/s
Redis QPS: 20 592/s (each add issues k Redis commands)
False positives: 0
Optimization
Each add call performs k separate Redis round‑trips, which dominates latency. Using Redis pipelining batches the commands, reducing RTT and improving throughput.
Redis Pipelining
Pipelining sends multiple commands in one network round‑trip and returns results in order, yielding roughly a ten‑fold speed increase for independent SETBIT operations.
Optimized Bloom Filter Class
class Bf {
// ... constructor unchanged ...
public function add($e) {
$e = (string)$e;
$this->redis->multi(Redis::PIPELINE);
for ($i = 0; $i < $this->k; $i++) {
$seed = self::getBKDRHashSeed($i);
$hash = self::BKDRHash($e, $seed);
$offset = $hash % $this->m;
$this->redis->setbit($this->key, $offset, 1);
}
$t1 = microtime(true);
$rt = $this->redis->exec();
$t2 = microtime(true);
$c = array_sum($rt);
return $c === $this->k;
}
}After pipelining, the total runtime dropped to 13 m 21 s, achieving an add QPS of 12 000/s and eliminating false positives.
Further Optimization – Distributed Bloom Filter
Redis strings are limited to 512 MB (2 32 bits). For larger m values (e.g., 500 billion bits ≈ 5.82 GB), we must shard the filter across multiple Redis instances and keys.
Two‑Level Sharding Design
First level: hash the input to select a Redis instance.
Second level: hash again to select a specific key (i.e., a partitioned Bloom filter) within that instance.
Distributed Bloom Filter Demo (Full Code)
class Bf {
public $key; public $m; public $k; public $nPartitions; public $redisCfg; public $nRedis; public $maxOffs = [];
const MAX_PARTITION_SIZE = 4294967296; // 512 MB
public function __construct($redisCfg, $key, $m, $k) {
$this->nRedis = count($redisCfg);
$this->nPartitions = ($m > self::MAX_PARTITION_SIZE) ? ceil(ceil($m / $this->nRedis) / self::MAX_PARTITION_SIZE) : 1;
$this->key = $key; $this->m = $m; $this->k = $k; $this->redisCfg = $redisCfg;
}
private function getPosition($e) {
$nRedis = count($this->redisCfg);
$hash = crc32($e);
$i = $hash % $nRedis;
$redis = SRedis::getSingeton($this->redisCfg[$i]);
$key = $this->key . '.' . ($hash % $this->nPartitions);
return [$i, $redis, $key];
}
public function add($e) {
$e = (string)$e;
list($n, $redis, $key) = $this->getPosition($e);
$redis->multi(Redis::PIPELINE);
for ($i = 0; $i < $this->k; $i++) {
$seed = self::getBKDRHashSeed($i);
$hash = self::BKDRHash($e, $seed);
$offset = $hash % $this->m;
$redis->setbit($key, $offset, 1);
}
$t1 = microtime(true);
$rt = $redis->exec();
$t2 = microtime(true);
$c = array_sum($rt);
return $c === $this->k;
}
// static helpers same as before
}
class SRedis {
public static function getSingeton($cfg) {
static $pool;
if (empty($cfg) || !is_array($cfg)) return false;
$k = serialize($cfg);
if (empty($pool[$k])) {
$redis = new Redis();
call_user_func_array([$redis, 'connect'], array_values($cfg));
$pool[$k] = $redis;
}
return $pool[$k];
}
}
// usage example omitted for brevityThe distributed implementation respects Redis's bit‑size limit while maintaining constant‑time membership checks.
Project Repository
GitHub: https://github.com/0x5446/bloomfilter-redis
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
