Databases 18 min read

How to Build a High‑Performance Redis‑Backed Bloom Filter for Billion‑Scale De‑duplication

This article explains the design and implementation of a space‑efficient Bloom filter using Redis bit operations, covering theory, false‑positive calculations, PHP code, performance testing on billions of daily logs, and optimizations such as pipelining and sharding for large‑scale deduplication.

21CTO
21CTO
21CTO
How to Build a High‑Performance Redis‑Backed Bloom Filter for Billion‑Scale De‑duplication

Background

We have a project that processes click logs at a scale of 1 billion events per day. Logs are collected from front‑end machines via Flume, sent to Kafka, and consumed by Spark, which stores results in Redis. Because Kafka guarantees no message loss, timeouts can cause message re‑delivery or duplicate consumption, so idempotence is required during consumption.

Given the massive deduplication data size (≈1 billion), we allow a controlled false‑positive rate and therefore consider using a Bloom filter, which offers extremely low memory usage and constant‑time operations.

Bloom‑Filter

Introduction

A Bloom filter, proposed by Burton Bloom in 1970, consists of a long bit vector and a set of hash functions. With k hash functions and an m -bit vector, insertion sets k bits to 1, and membership testing checks whether all k bits are 1.

False‑Positive Rate Calculation

The false‑positive probability can be approximated by: P \approx (1 - e^{-\frac{mk}{n}})^{k} When m/n is fixed, the optimal number of hash functions is: k = \frac{m}{n}\ln2 Choosing the nearest integer minimizes the false‑positive probability.

Application Scenarios

Web crawlers: checking whether a URL has already been crawled (large URL set, false‑positive acceptable).

Spam detection: billions of email addresses, occasional false‑positive is tolerable.

Parameters m and k must be selected based on data volume and acceptable false‑positive rate.

False‑Positive‑Ratio Table (including memory usage)

m      m/n  k   FPR        FPN   Mem
300亿  30   8   9.01e-6    9011  3.49GB
300亿  30   16  7.26e-7    726   3.49GB
500亿  50   8   2.28e-7    228   5.82GB
500亿  50   16  1e-9       1     5.82GB

For a 10‑minute slice (≈7 million records), the table becomes:

m       m/n  k   FPR        FPN   Mem
2100万  30   8   9.01e-6    63    25.03MB
2100万  30   16  7.26e-7    5     25.03MB
3500万  50   8   2.28e-7    1.6   41.72MB
3500万  50   16  1e-9       0     41.72MB

Choosing m/n = 50 and k = 16 satisfies the business requirement with a false‑positive rate of 1e-9.

Redis String SETBIT Method

SETBIT key offset value Sets or clears the bit at the specified offset in the string value stored at key . The offset must be between 0 and 2 32 ‑1 (bit mapping limited to 512 MB). Time complexity O(1). Returns the original bit value.
redis> SETBIT bit 10086 1 (integer) 0
redis> GETBIT bit 10086 (integer) 1
redis> GETBIT bit 100 (integer) 0

Since a Bloom filter operates on bits, we can implement it using SETBIT.

PHP Demo

BKDRHash

The BKDR hash function is simple and effective. C implementation:

// BKDR Hash Function
unsigned int BKDRHash(char *str) {
    unsigned int seed = 131; // 31, 131, 1313, ...
    unsigned int hash = 0;
    while (*str) {
        hash = hash * seed + (*str++);
    }
    return (hash & 0x7FFFFFFF);
}

We generate k different hash functions by varying the seed.

PHP BKDRHash Implementation

function getBKDRHashSeed($n) {
    if ($n === 0) return 31;
    $j = $n + 2;
    $r = 0;
    for ($i = 0; $i < $j; $i++) {
        $r = ($i % 2) ? $r * 10 + 3 : $r * 10 + 1;
    }
    return $r;
}

function BKDRHash($str, $seed) {
    $hash = 0;
    $len = strlen($str);
    for ($i = 0; $i < $len; $i++) {
        $hash = ((floatval($hash * $seed) & 0x7FFFFFFF) + ord($str[$i])) & 0x7FFFFFFF;
    }
    return $hash & 0x7FFFFFFF;
}

Bloom Filter Class

class Bf {
    public $redis;
    public $key;
    public $m;
    public $k;

    public function __construct($key, $m, $k) {
        if ($m > 4294967296) {
            error_log('ERROR: m over 4294967296');
            return false;
        }
        $this->key = $key;
        $this->m   = $m;
        $this->k   = $k;
        $this->redis = new Redis();
        $this->redis->connect('127.0.0.1', 6379);
    }

    public function add($e) {
        $e = (string)$e;
        $c = 0;
        for ($i = 0; $i < $this->k; $i++) {
            $seed = self::getBKDRHashSeed($i);
            $hash = self::BKDRHash($e, $seed);
            $offset = $hash % $this->m;
            $c += $this->redis->setbit($this->key, $offset, 1);
        }
        return $c === $this->k;
    }

    public function flushall() {
        return $this->redis->delete($this->key);
    }

    // static helpers omitted for brevity
}

Testing with 10 million random strings (4‑12 characters) produced 7 458 78 duplicates, which were correctly detected by the filter with zero false positives.

Test Results

Total time: 1 h 4 m 45 s

Bloom‑Filter Add QPS: 2 574/s

Redis QPS: 20 592/s (each add issues k Redis commands)

False positives: 0

Optimization

Each add call performs k separate Redis round‑trips, which dominates latency. Using Redis pipelining batches the commands, reducing RTT and improving throughput.

Redis Pipelining

Pipelining sends multiple commands in one network round‑trip and returns results in order, yielding roughly a ten‑fold speed increase for independent SETBIT operations.

Optimized Bloom Filter Class

class Bf {
    // ... constructor unchanged ...
    public function add($e) {
        $e = (string)$e;
        $this->redis->multi(Redis::PIPELINE);
        for ($i = 0; $i < $this->k; $i++) {
            $seed = self::getBKDRHashSeed($i);
            $hash = self::BKDRHash($e, $seed);
            $offset = $hash % $this->m;
            $this->redis->setbit($this->key, $offset, 1);
        }
        $t1 = microtime(true);
        $rt = $this->redis->exec();
        $t2 = microtime(true);
        $c = array_sum($rt);
        return $c === $this->k;
    }
}

After pipelining, the total runtime dropped to 13 m 21 s, achieving an add QPS of 12 000/s and eliminating false positives.

Further Optimization – Distributed Bloom Filter

Redis strings are limited to 512 MB (2 32 bits). For larger m values (e.g., 500 billion bits ≈ 5.82 GB), we must shard the filter across multiple Redis instances and keys.

Two‑Level Sharding Design

First level: hash the input to select a Redis instance.

Second level: hash again to select a specific key (i.e., a partitioned Bloom filter) within that instance.

Distributed Bloom Filter Demo (Full Code)

class Bf {
    public $key; public $m; public $k; public $nPartitions; public $redisCfg; public $nRedis; public $maxOffs = [];
    const MAX_PARTITION_SIZE = 4294967296; // 512 MB

    public function __construct($redisCfg, $key, $m, $k) {
        $this->nRedis = count($redisCfg);
        $this->nPartitions = ($m > self::MAX_PARTITION_SIZE) ? ceil(ceil($m / $this->nRedis) / self::MAX_PARTITION_SIZE) : 1;
        $this->key = $key; $this->m = $m; $this->k = $k; $this->redisCfg = $redisCfg;
    }

    private function getPosition($e) {
        $nRedis = count($this->redisCfg);
        $hash = crc32($e);
        $i = $hash % $nRedis;
        $redis = SRedis::getSingeton($this->redisCfg[$i]);
        $key = $this->key . '.' . ($hash % $this->nPartitions);
        return [$i, $redis, $key];
    }

    public function add($e) {
        $e = (string)$e;
        list($n, $redis, $key) = $this->getPosition($e);
        $redis->multi(Redis::PIPELINE);
        for ($i = 0; $i < $this->k; $i++) {
            $seed = self::getBKDRHashSeed($i);
            $hash = self::BKDRHash($e, $seed);
            $offset = $hash % $this->m;
            $redis->setbit($key, $offset, 1);
        }
        $t1 = microtime(true);
        $rt = $redis->exec();
        $t2 = microtime(true);
        $c = array_sum($rt);
        return $c === $this->k;
    }

    // static helpers same as before
}

class SRedis {
    public static function getSingeton($cfg) {
        static $pool;
        if (empty($cfg) || !is_array($cfg)) return false;
        $k = serialize($cfg);
        if (empty($pool[$k])) {
            $redis = new Redis();
            call_user_func_array([$redis, 'connect'], array_values($cfg));
            $pool[$k] = $redis;
        }
        return $pool[$k];
    }
}

// usage example omitted for brevity

The distributed implementation respects Redis's bit‑size limit while maintaining constant‑time membership checks.

Project Repository

GitHub: https://github.com/0x5446/bloomfilter-redis

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

redisdeduplicationbloom-filter
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.