Big Data 15 min read

How Bloom Filters Power Fast Big Data Searches with Python

This tutorial walks through building a simple Python search engine for big data, covering Bloom filter basics, tokenization with major and minor segmentation, inverted index creation, and implementing both simple and complex (AND/OR) queries, complete with code examples and visual illustrations.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How Bloom Filters Power Fast Big Data Searches with Python

Introduction

Search is a common requirement in big data, with Splunk and ELK leading the market. This article demonstrates a minimal Python implementation of a data search system to help readers understand the fundamental principles behind big data search.

Bloom Filter

A Bloom filter is a probabilistic data structure that quickly determines whether an element is possibly in a set or definitely not. It uses a bitmap and multiple hash functions to set bits for added items and to test membership.

class Bloomfilter(object):
    """A Bloom filter is a probabilistic data-structure that trades space for accuracy
    when determining if a value is in a set. It can tell you if a value was possibly
    added, or if it was definitely not added, but it can't tell you for certain that
    it was added."""
    def __init__(self, size):
        """Setup the BF with the appropriate size"""
        self.values = [False] * size
        self.size = size
    def hash_value(self, value):
        """Hash the value provided and scale it to fit the BF size"""
        return hash(value) % self.size
    def add_value(self, value):
        """Add a value to the BF"""
        h = self.hash_value(value)
        self.values[h] = True
    def might_contain(self, value):
        """Check if the value might be in the BF"""
        h = self.hash_value(value)
        return self.values[h]
    def print_contents(self):
        """Dump the contents of the BF for debugging purposes"""
        print(self.values)

The filter uses an array of bits (initially all False) and hash functions to map elements to indices. Adding an element sets the corresponding bit to True. Checking membership returns True if the bit is set, otherwise False. The article shows examples of adding "dog", "fish", "cat", and "bird" and demonstrates hash collisions.

Bloom filter after adding elements
Bloom filter after adding elements

Segmentation (Tokenization)

Tokenization splits text into searchable units (terms). The article implements a simple major segmentation based on spaces and a minor segmentation that captures additional sub‑segments.

def major_segments(s):
    """Perform major segmenting on a string. Split the string by all of the major
    breaks, and return the set of everything found. The breaks in this implementation
    are single characters, but in Splunk proper they can be multiple characters.
    A set is used because ordering doesn't matter, and duplicates are bad."""
    major_breaks = ' '
    last = -1
    results = set()
    for idx, ch in enumerate(s):
        if ch in major_breaks:
            segment = s[last+1:idx]
            results.add(segment)
    # capture the last segment
    segment = s[last+1:]
    results.add(segment)
    return results

def minor_segments(s):
    """Perform minor segmenting on a string. This is like major segmenting, except it also captures from the start of the
    input to each break."""
    minor_breaks = '_. '
    last = -1
    results = set()
    for idx, ch in enumerate(s):
        if ch in minor_breaks:
            segment = s[last+1:idx]
            results.add(segment)
    segment = s[last+1:]
    results.add(segment)
    return results

def segments(event):
    """Simple wrapper around major_segments / minor_segments"""
    results = set()
    for major in major_segments(event):
        for minor in minor_segments(major):
            results.add(minor)
    return results

Using these functions, a string like 'src_ip = 1.2.3.4' is broken into tokens such as src_ip, 1.2.3.4, src, ip, etc. Images illustrate the tokenization results.

Tokenization example
Tokenization example

Search Engine Implementation

The Splunk class combines a Bloom filter, a term‑to‑event inverted index, and an event list. Adding an event tokenizes the event, updates the Bloom filter, and records the event ID under each term.

class Splunk(object):
    def __init__(self):
        self.bf = Bloomfilter(64)
        self.terms = {}
        self.events = []
    def add_event(self, event):
        event_id = len(self.events)
        self.events.append(event)
        for term in segments(event):
            self.bf.add_value(term)
            if term not in self.terms:
                self.terms[term] = set()
            self.terms[term].add(event_id)
    def search(self, term):
        """Search for a single term, and yield all the events that contain it"""
        if not self.bf.might_contain(term):
            return
        if term not in self.terms:
            return
        for event_id in sorted(self.terms[term]):
            yield self.events[event_id]

Example usage adds three events and performs searches for IP addresses and field names, printing matching events.

Complex Search (AND/OR)

The SplunkM class extends the basic engine with methods for conjunctive (AND) and disjunctive (OR) queries using Python set operations.

class SplunkM(object):
    def __init__(self):
        self.bf = Bloomfilter(64)
        self.terms = {}
        self.events = []
    def add_event(self, event):
        event_id = len(self.events)
        self.events.append(event)
        for term in segments(event):
            self.bf.add_value(term)
            if term not in self.terms:
                self.terms[term] = set()
            self.terms[term].add(event_id)
    def search_all(self, terms):
        """Search for an AND of all terms"""
        results = set(range(len(self.events)))
        for term in terms:
            if not self.bf.might_contain(term) or term not in self.terms:
                return
            results = results.intersection(self.terms[term])
        for event_id in sorted(results):
            yield self.events[event_id]
    def search_any(self, terms):
        """Search for an OR of all terms"""
        results = set()
        for term in terms:
            if not self.bf.might_contain(term) or term not in self.terms:
                continue
            results = results.union(self.terms[term])
        for event_id in sorted(results):
            yield self.events[event_id]

Using set intersection and union enables efficient AND/OR query processing. Sample runs show how events matching multiple fields are retrieved.

Conclusion

The provided code illustrates the core ideas behind big‑data search: Bloom filters for fast negative checks, tokenization to build an inverted index, and set operations for complex queries. A production‑grade system would require many additional features. The material originates from Splunk Conf2017.

Splunk architecture illustration
Splunk architecture illustration
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataPythonInverted Indextokenizationbloom filterSearchAND/OR queries
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.