Big Data 13 min read

Build a Simple Big Data Search Engine with Bloom Filters and Tokenization in Python

This article walks through implementing a basic big‑data search system in Python, covering Bloom filter basics, tokenization of text, inverted index construction, and how to combine these techniques to support fast AND/OR queries.

MaGe Linux Operations

Dec 3, 2017

Bloom Filter

A Bloom filter is a probabilistic data structure that quickly tells whether an element is definitely not in a set or possibly in it, trading a small false‑positive rate for speed and space efficiency.

class Bloomfilter(object):
    """A Bloom filter is a probabilistic data‑structure that trades space for accuracy"""
    def __init__(self, size):
        self.values = [False] * size
        self.size = size
    def hash_value(self, value):
        """Hash the value and scale it to fit the BF size"""
        return hash(value) % self.size
    def add_value(self, value):
        """Add a value to the BF"""
        h = self.hash_value(value)
        self.values[h] = True
    def might_contain(self, value):
        """Check if the value might be in the BF"""
        h = self.hash_value(value)
        return self.values[h]
    def print_contents(self):
        """Dump the contents of the BF for debugging purposes"""
        print(self.values)

Using this filter, a false result means the element is definitely absent, while a true result only indicates a possible presence.

Segmentation (Tokenization)

Tokenization splits a string into the smallest searchable units (terms). The example focuses on English, using spaces as major breaks and a set of punctuation characters as minor breaks.

def major_segments(s):
    """Split the string by major breaks (spaces) and return a set of segments"""
    major_breaks = ' '
    last = -1
    results = set()
    for idx, ch in enumerate(s):
        if ch in major_breaks:
            segment = s[last+1:idx]
            results.add(segment)
            last = idx
    # Capture the final segment
    segment = s[last+1:]
    results.add(segment)
    return results

def minor_segments(s):
    """Further split each major segment by minor breaks (punctuation)"""
    minor_breaks = '_.'
    last = -1
    results = set()
    for idx, ch in enumerate(s):
        if ch in minor_breaks:
            segment = s[last+1:idx]
            results.add(segment)
            last = idx
    segment = s[last+1:]
    results.add(segment)
    return results

def segments(event):
    """Combine major and minor segmentation"""
    results = set()
    for major in major_segments(event):
        for minor in minor_segments(major):
            results.add(minor)
    return results

Search Engine Implementation

The Splunk class ties together a Bloom filter, an inverted index (a dictionary mapping terms to sets of event IDs), and a list of raw events.

class Splunk(object):
    def __init__(self):
        self.bf = Bloomfilter(64)          # Bloom filter for fast existence checks
        self.terms = {}                    # term -> set(event_id)
        self.events = []                   # list of raw events

    def add_event(self, event):
        """Add an event and index its terms"""
        event_id = len(self.events)
        self.events.append(event)
        for term in segments(event):
            if term not in self.terms:
                self.terms[term] = set()
            self.terms[term].add(event_id)
            self.bf.add_value(term)

    def search(self, term):
        """Return events that contain a single term"""
        if not self.bf.might_contain(term):
            return []
        if term not in self.terms:
            return []
        return [self.events[eid] for eid in sorted(self.terms[term])]

    def search_all(self, terms):
        """AND query – events that contain *all* terms"""
        # Start with the universe of event IDs
        results = set(range(len(self.events)))
        for term in terms:
            if not self.bf.might_contain(term):
                return []
            if term not in self.terms:
                return []
            results = results.intersection(self.terms[term])
        return [self.events[eid] for eid in sorted(results)]

    def search_any(self, terms):
        """OR query – events that contain *any* of the terms"""
        results = set()
        for term in terms:
            if not self.bf.might_contain(term):
                continue
            if term not in self.terms:
                continue
            results = results.union(self.terms[term])
        return [self.events[eid] for eid in sorted(results)]

Example usage adds three events, indexes them, and demonstrates single‑term, AND, and OR searches, showing how Bloom filter shortcuts eliminate unnecessary look‑ups.

Conclusion

The code illustrates the core principles behind big‑data search: using a Bloom filter to prune non‑existent terms, tokenizing text into searchable terms, and maintaining an inverted index to retrieve matching events. While functional for demonstration, a production system would need a much larger filter, more sophisticated tokenization (especially for non‑English languages), and persistent storage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Python inverted index Tokenization bloom filter Search

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.