Build a Simple Big Data Search Engine with Bloom Filters and Tokenization in Python
This article walks through implementing a basic big‑data search system in Python, covering Bloom filter basics, tokenization of text, inverted index construction, and how to combine these techniques to support fast AND/OR queries.
Bloom Filter
A Bloom filter is a probabilistic data structure that quickly tells whether an element is definitely not in a set or possibly in it, trading a small false‑positive rate for speed and space efficiency.
class Bloomfilter(object):
"""A Bloom filter is a probabilistic data‑structure that trades space for accuracy"""
def __init__(self, size):
self.values = [False] * size
self.size = size
def hash_value(self, value):
"""Hash the value and scale it to fit the BF size"""
return hash(value) % self.size
def add_value(self, value):
"""Add a value to the BF"""
h = self.hash_value(value)
self.values[h] = True
def might_contain(self, value):
"""Check if the value might be in the BF"""
h = self.hash_value(value)
return self.values[h]
def print_contents(self):
"""Dump the contents of the BF for debugging purposes"""
print(self.values)Using this filter, a false result means the element is definitely absent, while a true result only indicates a possible presence.
Segmentation (Tokenization)
Tokenization splits a string into the smallest searchable units (terms). The example focuses on English, using spaces as major breaks and a set of punctuation characters as minor breaks.
def major_segments(s):
"""Split the string by major breaks (spaces) and return a set of segments"""
major_breaks = ' '
last = -1
results = set()
for idx, ch in enumerate(s):
if ch in major_breaks:
segment = s[last+1:idx]
results.add(segment)
last = idx
# Capture the final segment
segment = s[last+1:]
results.add(segment)
return results
def minor_segments(s):
"""Further split each major segment by minor breaks (punctuation)"""
minor_breaks = '_.'
last = -1
results = set()
for idx, ch in enumerate(s):
if ch in minor_breaks:
segment = s[last+1:idx]
results.add(segment)
last = idx
segment = s[last+1:]
results.add(segment)
return results
def segments(event):
"""Combine major and minor segmentation"""
results = set()
for major in major_segments(event):
for minor in minor_segments(major):
results.add(minor)
return resultsSearch Engine Implementation
The Splunk class ties together a Bloom filter, an inverted index (a dictionary mapping terms to sets of event IDs), and a list of raw events.
class Splunk(object):
def __init__(self):
self.bf = Bloomfilter(64) # Bloom filter for fast existence checks
self.terms = {} # term -> set(event_id)
self.events = [] # list of raw events
def add_event(self, event):
"""Add an event and index its terms"""
event_id = len(self.events)
self.events.append(event)
for term in segments(event):
if term not in self.terms:
self.terms[term] = set()
self.terms[term].add(event_id)
self.bf.add_value(term)
def search(self, term):
"""Return events that contain a single term"""
if not self.bf.might_contain(term):
return []
if term not in self.terms:
return []
return [self.events[eid] for eid in sorted(self.terms[term])]
def search_all(self, terms):
"""AND query – events that contain *all* terms"""
# Start with the universe of event IDs
results = set(range(len(self.events)))
for term in terms:
if not self.bf.might_contain(term):
return []
if term not in self.terms:
return []
results = results.intersection(self.terms[term])
return [self.events[eid] for eid in sorted(results)]
def search_any(self, terms):
"""OR query – events that contain *any* of the terms"""
results = set()
for term in terms:
if not self.bf.might_contain(term):
continue
if term not in self.terms:
continue
results = results.union(self.terms[term])
return [self.events[eid] for eid in sorted(results)]Example usage adds three events, indexes them, and demonstrates single‑term, AND, and OR searches, showing how Bloom filter shortcuts eliminate unnecessary look‑ups.
Conclusion
The code illustrates the core principles behind big‑data search: using a Bloom filter to prune non‑existent terms, tokenizing text into searchable terms, and maintaining an inverted index to retrieve matching events. While functional for demonstration, a production system would need a much larger filter, more sophisticated tokenization (especially for non‑English languages), and persistent storage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
