How Bloom Filters Eliminate Cache Penetration and Boost Backend Performance
This article explains why cache penetration occurs in high‑dimensional data scenarios, introduces Bloom Filters as a solution, outlines suitable use cases such as database lookups, cache outages, web request filtering, malicious URL detection and Bitcoin sync, and details practical Redis‑based implementations, advantages, drawbacks, and advanced counting filter techniques.
Why Introduce Bloom Filter
In our business we often encounter cache penetration; caching alone is insufficient when data dimensions are large and result sets are big.
To solve this problem we introduce Bloom Filter.
Suitable Scenarios
Database cache penetration: Google Bigtable, Apache HBase, Apache Cassandra and PostgreSQL use Bloom Filters to reduce disk lookups for non‑existent rows or columns.
Cache outage: Bloom Filter may cause false positives because of its inherent false‑positive rate and because the cache may not cover all DB data before a crash.
Web interceptor: Store request parameters in a Bloom Filter to block repeated attacks and improve cache hit rate.
Malicious URL detection: Chrome checks URLs against a local Bloom Filter before performing full inspection.
Bitcoin acceleration: Bitcoin wallets use Bloom Filters to speed up synchronization.
Open‑source project address: https://github.com/luw2007/bloomfilter
Typical business cache flow: first query the cache; if it misses, query the database; then store the result in the cache even if the data does not exist, to prevent penetration.
This flow can cause a large number of empty cache entries, leading to a cache‑avalanche effect under heavy load.
Second version – Redis filter for cold users:
We store users that hit the database in a Redis set without expiration, using Redis as an index to quickly determine data existence.
When a key is not in Redis we return a negative result directly; otherwise we follow the normal cache flow.
Common questions:
Why not return data directly from Redis?
Will a single large set cause performance issues?
Is the memory cost of storing all keys in Redis justified?
Bloom Filter Principles
A Bloom Filter is a long binary vector combined with several random hash functions. It can test whether an element is in a set.
Using multiple hash functions reduces collision probability, but as the filter fills, false‑positive rates increase.
When more hash functions are added, collisions decrease, improving accuracy, as shown in the experimental tables.
However, adding more hash functions also consumes more space; once the filter reaches about 25% of its capacity, additional hashes provide diminishing returns.
Algorithm Advantages
Very small space usage; does not store the actual data.
Fast query time.
Algorithm Disadvantages
Elements cannot be deleted.
Only guarantees "definitely not in set"; a positive result may be a false positive.
False‑positive probability grows as the filter becomes full.
Space overhead: for a 1% false‑positive rate each element needs less than 10 bits, independent of element size.
How to Use Bloom Filter
Redis is a suitable storage container. Options include:
Native Python using SETBIT to build a Bloom Filter.
Lua scripts.
Rebloom – a Redis module (available from Redis 4.0).
Using hiredis with pyreBloom.
We recommend pyreBloom for its performance and ease of deployment.
pyreBloom:master λ ls
Makefile bloom.h bloom.pxd murmur.c pyreBloom.pyx
bloom.c bloom.o main.c pyreBloom.cThe bloom.h file implements core Bloom Filter logic, hash functions, and add/check/delete methods.
Advanced: Counting Filter
A counting filter extends each bucket to an n‑bit counter, allowing deletions at the cost of 3–4× more space.
Limitations include counter overflow and the need to know the maximum number of keys in advance; exceeding capacity sharply raises false‑positive rates.
Research improvements such as d‑left hashing, variable‑increment counters, and other variants aim to reduce space and improve deletability.
END
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
