Building a Simple Search Engine with Bloom Filter, Tokenization, and Inverted Index in Python
This article demonstrates how to implement a basic big‑data search engine in Python by creating a Bloom filter for fast existence checks, designing tokenization functions for major and minor segmentation, building an inverted index, and supporting AND/OR queries with example code and execution results.
The article introduces the need for efficient search in big‑data environments and explains that Bloom filters (a probabilistic data structure) can quickly rule out non‑existent terms.
Bloom Filter Implementation
<code>class Bloomfilter(object):
def __init__(self, size):
self.values = [False] * size
self.size = size
def hash_value(self, value):
return hash(value) % self.size
def add_value(self, value):
h = self.hash_value(value)
self.values[h] = True
def might_contain(self, value):
h = self.hash_value(value)
return self.values[h]
def print_contents(self):
print(self.values)
</code>Sample usage shows adding items like 'dog', 'fish', and 'cat' and observing hash collisions.
Tokenization (Segmentation)
<code>def major_segments(s):
"""Split by major breaks (space) and return a set of segments."""
major_breaks = ' '
results = set()
for idx, ch in enumerate(s):
if ch in major_breaks:
segment = s[last+1:idx]
results.add(segment)
last = idx
# capture the final segment
results.add(s[last+1:])
return results
def minor_segments(s):
"""Further split each major segment by minor breaks ("_.")."""
minor_breaks = '_. '
results = set()
for idx, ch in enumerate(s):
if ch in minor_breaks:
segment = s[last+1:idx]
results.add(segment)
last = idx
results.add(s[last+1:])
return results
def segments(event):
"""Combine major and minor segmentation for an event string."""
results = set()
for major in major_segments(event):
for minor in minor_segments(major):
results.add(minor)
return results
</code>These functions turn a log line such as src_ip = 1.2.3.4 into searchable tokens.
Search Engine Class
<code>class Splunk(object):
def __init__(self):
self.bf = Bloomfilter(64)
self.terms = {}
self.events = []
def add_event(self, event):
event_id = len(self.events)
self.events.append(event)
for term in segments(event):
self.bf.add_value(term)
if term not in self.terms:
self.terms[term] = set()
self.terms[term].add(event_id)
def search(self, term):
if not self.bf.might_contain(term) or term not in self.terms:
return []
return [self.events[i] for i in sorted(self.terms[term])]
def search_all(self, terms):
# AND query – intersection of all term result sets
results = set(range(len(self.events)))
for term in terms:
if not self.bf.might_contain(term) or term not in self.terms:
return []
results = results.intersection(self.terms[term])
return [self.events[i] for i in sorted(results)]
def search_any(self, terms):
# OR query – union of all term result sets
results = set()
for term in terms:
if not self.bf.might_contain(term) or term not in self.terms:
continue
results = results.union(self.terms[term])
return [self.events[i] for i in sorted(results)]
</code>Example usage adds three events, then demonstrates single‑term search, AND search (e.g., ['src_ip', '5.6'] ), and OR search (e.g., ['src_ip', 'dst_ip'] ), showing how Bloom filters prune unnecessary look‑ups and how Python set operations provide fast AND/OR semantics.
The article concludes that while the code illustrates core concepts—Bloom filter, tokenization, and inverted index—it is a simplified prototype far from a production‑grade search engine, and the ideas originate from Splunk Conf 2017.
Note: The final section of the source contains promotional QR‑code material for a free Python course, which is not part of the technical explanation.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.