Big Data 14 min read

How a Simple Bloom Filter Powers Fast Big Data Searches in Python

This article demonstrates how to build a basic big‑data search engine in Python using a Bloom filter, tokenisation, and an inverted index, showing step‑by‑step code examples that illustrate the core principles behind efficient data lookup.

MaGe Linux Operations

Mar 5, 2018

How a Simple Bloom Filter Powers Fast Big Data Searches in Python

Introduction

Search is a common requirement in big‑data environments. Splunk (closed‑source) and ELK (open‑source) dominate the market. The following Python example implements a minimal search engine to illustrate the fundamental concepts of big‑data search.

Bloom Filter

A Bloom filter is a probabilistic data structure that quickly determines whether an element is definitely not in a set or possibly in it.

class Bloomfilter(object):
    """A Bloom filter is a probabilistic data‑structure that trades space for accuracy
    when determining if a value is in a set. It can tell you if a value was possibly
    added, or if it was definitely not added, but it can't tell you for certain that
    it was added."""
    def __init__(self, size):
        """Setup the BF with the appropriate size"""
        self.values = [False] * size
        self.size = size
    def hash_value(self, value):
        """Hash the value provided and scale it to fit the BF size"""
        return hash(value) % self.size
    def add_value(self, value):
        """Add a value to the BF"""
        h = self.hash_value(value)
        self.values[h] = True
    def might_contain(self, value):
        """Check if the value might be in the BF"""
        h = self.hash_value(value)
        return self.values[h]
    def print_contents(self):
        """Dump the contents of the BF for debugging purposes"""
        print(self.values)

Key points:

The underlying structure is a bitmap (array of 0/1).

Hash functions map elements to bitmap indices.

Adding an element sets the corresponding bit to True.

Checking membership reads the bit; False guarantees absence, True may be a false positive.

Example usage shows how adding 'dog', 'fish', 'cat' updates the filter, and how a collision with 'bird' leaves the bitmap unchanged.

Tokenisation

Tokenisation splits text into searchable units (terms). The example focuses on English, using spaces as major delimiters and _. as minor delimiters.

def major_segments(s):
    """Perform major segmenting on a string. Split the string by all of the major
    breaks, and return the set of everything found. The breaks in this implementation
    are single characters, but in Splunk proper they can be multiple characters.
    A set is used because ordering doesn't matter, and duplicates are bad."""
    major_breaks = ' '
    last = -1
    results = set()
    for idx, ch in enumerate(s):
        if ch in major_breaks:
            segment = s[last+1:idx]
            results.add(segment)
    # Capture the last segment
    segment = s[last+1:]
    results.add(segment)
    return results

def minor_segments(s):
    """Perform minor segmenting on a string. This is like major segmenting, except it also captures
    from the start of the input to each break."""
    minor_breaks = '_. '
    last = -1
    results = set()
    for idx, ch in enumerate(s):
        if ch in minor_breaks:
            segment = s[last+1:idx]
            results.add(segment)
    # Capture the last segment
    segment = s[last+1:]
    results.add(segment)
    return results

def segments(event):
    """Simple wrapper around major_segments / minor_segments"""
    results = set()
    for major in major_segments(event):
        for minor in minor_segments(major):
            results.add(minor)
    return results

The overall tokenisation first applies major_segments then minor_segments to each major token, returning all unique terms.

Search Engine Implementation

class Splunk(object):
    def __init__(self):
        self.bf = Bloomfilter(64)
        self.terms = {}
        self.events = []
    def add_event(self, event):
        """Adds an event to this object"""
        event_id = len(self.events)
        self.events.append(event)
        for term in segments(event):
            self.bf.add_value(term)
            if term not in self.terms:
                self.terms[term] = set()
            self.terms[term].add(event_id)
    def search(self, term):
        """Search for a single term, and yield all the events that contain it"""
        if not self.bf.might_contain(term):
            return
        if term not in self.terms:
            return
        for event_id in sorted(self.terms[term]):
            yield self.events[event_id]

When an event is added, each token is inserted into the Bloom filter and the inverted index ( self.terms) maps the token to a set of event IDs. A search first checks the Bloom filter (O(1) fast rejection) and then looks up the term in the inverted index.

Complex Search (AND / OR)

class SplunkM(object):
    def __init__(self):
        self.bf = Bloomfilter(64)
        self.terms = {}
        self.events = []
    def add_event(self, event):
        event_id = len(self.events)
        self.events.append(event)
        for term in segments(event):
            self.bf.add_value(term)
            if term not in self.terms:
                self.terms[term] = set()
            self.terms[term].add(event_id)
    def search_all(self, terms):
        """Search for an AND of all terms"""
        results = set(range(len(self.events)))
        for term in terms:
            if not self.bf.might_contain(term):
                return
            if term not in self.terms:
                return
            results = results.intersection(self.terms[term])
        for event_id in sorted(results):
            yield self.events[event_id]
    def search_any(self, terms):
        """Search for an OR of all terms"""
        results = set()
        for term in terms:
            if not self.bf.might_contain(term):
                continue
            if term not in self.terms:
                continue
            results = results.union(self.terms[term])
        for event_id in sorted(results):
            yield self.events[event_id]

Both methods start with the universe of events, then intersect (AND) or union (OR) the sets of IDs for each term, using the Bloom filter to skip impossible terms.

Demonstration

Sample usage adds three events ( 'src_ip = 1.2.3.4', 'src_ip = 5.6.7.8', 'dst_ip = 1.2.3.4') and performs searches for specific IPs or field names, showing how the engine returns matching events.

Conclusion

The code illustrates the basic principles of big‑data search: a Bloom filter for fast negative checks, tokenisation to build an inverted index, and set operations to support AND/OR queries. A production‑grade system would require many more features; this example is derived from Splunk Conf 2017.

Python Inverted Index bloom filter big data search tokenisation

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.