Understanding Bloom Filters: Fast, Space-Efficient Membership Tests
Bloom filters are highly space-efficient probabilistic data structures that quickly test set membership using multiple hash functions, guaranteeing no false negatives while allowing a small false positive rate, making them ideal for large-scale applications such as email blacklists and massive URL deduplication.
1. Introduction to Bloom Filter
Bloom Filter is a highly space-efficient probabilistic data structure, an extension of a bit‑map. When an element is added, K hash functions map it to K bits in a bit array, setting them to 1. To query, we check those bits: if any is 0 the element is definitely absent; if all are 1 the element may be present.
Example: storing one hundred million email addresses using a 1.6‑billion‑bit vector (200 MB). Eight independent hash functions generate eight fingerprints for each address, which are mapped to eight positions in the bit array and set to 1. After processing all addresses, the Bloom filter is built.
To test whether a suspicious email Y is on a blacklist, the same eight hash functions produce eight fingerprints, which are checked against the filter. If Y is in the blacklist, all corresponding bits will be 1, allowing detection.
The filter never produces false negatives, but it can yield false positives when bits set by other elements cause an absent element to appear present. In the example the false‑positive probability is below 0.01%.
Bloom filters are fast and space‑saving, though they have a small error rate. A common mitigation is to keep a small whitelist for elements that might be misidentified.
2. Use Cases
Main scenario: Detect whether an element belongs to a massive data set, accepting a possible error.
False‑positive scenario: an element that does not exist may be reported as present because previous elements set the same bits.
If an element truly belongs to the set, the Bloom filter will always confirm its presence.
3. Application Example
Given two files A and B, each containing 5 billion URLs (64 bytes each) with a 4 GB memory limit, find common URLs. Using a Bloom filter with ~3.4 billion bits (≈4 GB) to represent one file’s URLs, then scanning the other file and checking membership can identify common URLs, tolerating a small error rate.
Source: http://www.cnblogs.com/moonandstar08/p/5317658.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
