How a Simple Python Bloom Filter Powers Fast Big Data Search
This article demonstrates how to implement a basic Bloom filter, tokenization, and inverted index in Python to illustrate the core principles of big‑data search, including fast negative lookups, term segmentation, and support for AND/OR queries.
Search is a common requirement in the big data domain. Splunk and ELK are the leading solutions for non‑open‑source and open‑source respectively. This tutorial shows how to build a simple search engine using a Bloom filter, tokenization, and an inverted index with minimal Python code.
Bloom Filter
A Bloom filter is a probabilistic data structure that quickly determines whether an element is definitely not in a set or possibly in it. The implementation includes initialization, hashing, adding values, checking membership, and printing the filter contents.
class Bloomfilter(object):
def __init__(self, size):
self.values = [False] * size
self.size = size
def hash_value(self, value):
return hash(value) % self.size
def add_value(self, value):
h = self.hash_value(value)
self.values[h] = True
def might_contain(self, value):
h = self.hash_value(value)
return self.values[h]
def print_contents(self):
print(self.values)After creating a filter of size 10, adding items like 'dog', 'fish', and 'cat' sets the corresponding bits. Adding 'bird' may not change the filter if it hashes to an already‑set position.
Tokenization
Tokenization splits text into searchable units. The major segmentation uses spaces, while the minor segmentation captures additional sub‑segments. Both functions return a set of unique tokens.
def major_segments(s):
"""Split the string by spaces and return a set of tokens."""
results = set()
for idx, ch in enumerate(s):
if ch in ' ':
segment = s[last+1:idx]
results.add(segment)
last = idx
# capture the last segment
results.add(s[last+1:])
return results
def minor_segments(s):
"""Further split each major segment by '_' and '.' characters."""
results = set()
for idx, ch in enumerate(s):
if ch in '_.':
segment = s[last+1:idx]
results.add(segment)
last = idx
results.add(s[last+1:])
return results
def segments(event):
results = set()
for major in major_segments(event):
for minor in minor_segments(major):
results.add(minor)
return resultsSearch Engine (Splunk)
The Splunk class maintains a Bloom filter, a dictionary mapping terms to event IDs, and a list of events. Adding an event tokenizes the event, updates the Bloom filter, and records the mapping.
class Splunk(object):
def __init__(self):
self.bf = Bloomfilter(64)
self.terms = {}
self.events = []
def add_event(self, event):
event_id = len(self.events)
self.events.append(event)
for term in segments(event):
self.bf.add_value(term)
if term not in self.terms:
self.terms[term] = set()
self.terms[term].add(event_id)
def search(self, term):
if not self.bf.might_contain(term) or term not in self.terms:
return []
for event_id in sorted(self.terms[term]):
yield self.events[event_id]Additional methods search_all, search_any, and search_all (AND query) use set intersection and union to support complex queries.
Complex Queries
Using Python set operations, the engine can efficiently handle AND (intersection) and OR (union) searches across multiple terms.
def search_all(self, terms):
results = set(range(len(self.events)))
for term in terms:
if not self.bf.might_contain(term) or term not in self.terms:
return []
results = results.intersection(self.terms[term])
for event_id in sorted(results):
yield self.events[event_id]
def search_any(self, terms):
results = set()
for term in terms:
if not self.bf.might_contain(term) or term not in self.terms:
continue
results = results.union(self.terms[term])
for event_id in sorted(results):
yield self.events[event_id]The provided examples show adding events, performing simple and complex searches, and printing results. The code illustrates the fundamental mechanisms behind big‑data search systems like Splunk, though a production system would require many additional features.
All content originates from Splunk Conf2017.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
