How a Simple Bloom Filter Powers Fast Big Data Searches in Python
This article demonstrates how to build a basic big‑data search engine in Python using a Bloom filter, tokenisation, and an inverted index, showing step‑by‑step code examples that illustrate the core principles behind efficient data lookup.
Introduction
Search is a common requirement in big‑data environments. Splunk (closed‑source) and ELK (open‑source) dominate the market. The following Python example implements a minimal search engine to illustrate the fundamental concepts of big‑data search.
Bloom Filter
A Bloom filter is a probabilistic data structure that quickly determines whether an element is definitely not in a set or possibly in it.
class Bloomfilter(object):
"""A Bloom filter is a probabilistic data‑structure that trades space for accuracy
when determining if a value is in a set. It can tell you if a value was possibly
added, or if it was definitely not added, but it can't tell you for certain that
it was added."""
def __init__(self, size):
"""Setup the BF with the appropriate size"""
self.values = [False] * size
self.size = size
def hash_value(self, value):
"""Hash the value provided and scale it to fit the BF size"""
return hash(value) % self.size
def add_value(self, value):
"""Add a value to the BF"""
h = self.hash_value(value)
self.values[h] = True
def might_contain(self, value):
"""Check if the value might be in the BF"""
h = self.hash_value(value)
return self.values[h]
def print_contents(self):
"""Dump the contents of the BF for debugging purposes"""
print(self.values)Key points:
The underlying structure is a bitmap (array of 0/1).
Hash functions map elements to bitmap indices.
Adding an element sets the corresponding bit to True.
Checking membership reads the bit; False guarantees absence, True may be a false positive.
Example usage shows how adding 'dog', 'fish', 'cat' updates the filter, and how a collision with 'bird' leaves the bitmap unchanged.
Tokenisation
Tokenisation splits text into searchable units (terms). The example focuses on English, using spaces as major delimiters and _. as minor delimiters.
def major_segments(s):
"""Perform major segmenting on a string. Split the string by all of the major
breaks, and return the set of everything found. The breaks in this implementation
are single characters, but in Splunk proper they can be multiple characters.
A set is used because ordering doesn't matter, and duplicates are bad."""
major_breaks = ' '
last = -1
results = set()
for idx, ch in enumerate(s):
if ch in major_breaks:
segment = s[last+1:idx]
results.add(segment)
# Capture the last segment
segment = s[last+1:]
results.add(segment)
return results
def minor_segments(s):
"""Perform minor segmenting on a string. This is like major segmenting, except it also captures
from the start of the input to each break."""
minor_breaks = '_. '
last = -1
results = set()
for idx, ch in enumerate(s):
if ch in minor_breaks:
segment = s[last+1:idx]
results.add(segment)
# Capture the last segment
segment = s[last+1:]
results.add(segment)
return results
def segments(event):
"""Simple wrapper around major_segments / minor_segments"""
results = set()
for major in major_segments(event):
for minor in minor_segments(major):
results.add(minor)
return resultsThe overall tokenisation first applies major_segments then minor_segments to each major token, returning all unique terms.
Search Engine Implementation
class Splunk(object):
def __init__(self):
self.bf = Bloomfilter(64)
self.terms = {}
self.events = []
def add_event(self, event):
"""Adds an event to this object"""
event_id = len(self.events)
self.events.append(event)
for term in segments(event):
self.bf.add_value(term)
if term not in self.terms:
self.terms[term] = set()
self.terms[term].add(event_id)
def search(self, term):
"""Search for a single term, and yield all the events that contain it"""
if not self.bf.might_contain(term):
return
if term not in self.terms:
return
for event_id in sorted(self.terms[term]):
yield self.events[event_id]When an event is added, each token is inserted into the Bloom filter and the inverted index ( self.terms) maps the token to a set of event IDs. A search first checks the Bloom filter (O(1) fast rejection) and then looks up the term in the inverted index.
Complex Search (AND / OR)
class SplunkM(object):
def __init__(self):
self.bf = Bloomfilter(64)
self.terms = {}
self.events = []
def add_event(self, event):
event_id = len(self.events)
self.events.append(event)
for term in segments(event):
self.bf.add_value(term)
if term not in self.terms:
self.terms[term] = set()
self.terms[term].add(event_id)
def search_all(self, terms):
"""Search for an AND of all terms"""
results = set(range(len(self.events)))
for term in terms:
if not self.bf.might_contain(term):
return
if term not in self.terms:
return
results = results.intersection(self.terms[term])
for event_id in sorted(results):
yield self.events[event_id]
def search_any(self, terms):
"""Search for an OR of all terms"""
results = set()
for term in terms:
if not self.bf.might_contain(term):
continue
if term not in self.terms:
continue
results = results.union(self.terms[term])
for event_id in sorted(results):
yield self.events[event_id]Both methods start with the universe of events, then intersect (AND) or union (OR) the sets of IDs for each term, using the Bloom filter to skip impossible terms.
Demonstration
Sample usage adds three events ( 'src_ip = 1.2.3.4', 'src_ip = 5.6.7.8', 'dst_ip = 1.2.3.4') and performs searches for specific IPs or field names, showing how the engine returns matching events.
Conclusion
The code illustrates the basic principles of big‑data search: a Bloom filter for fast negative checks, tokenisation to build an inverted index, and set operations to support AND/OR queries. A production‑grade system would require many more features; this example is derived from Splunk Conf 2017.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
