Big Data 13 min read

How a Simple Python Bloom Filter Powers Fast Big Data Search

This article demonstrates how to implement a basic Bloom filter, tokenization, and inverted index in Python to illustrate the core principles of big‑data search, including fast negative lookups, term segmentation, and support for AND/OR queries.

MaGe Linux Operations

Nov 27, 2018

Search is a common requirement in the big data domain. Splunk and ELK are the leading solutions for non‑open‑source and open‑source respectively. This tutorial shows how to build a simple search engine using a Bloom filter, tokenization, and an inverted index with minimal Python code.

Bloom Filter

A Bloom filter is a probabilistic data structure that quickly determines whether an element is definitely not in a set or possibly in it. The implementation includes initialization, hashing, adding values, checking membership, and printing the filter contents.

class Bloomfilter(object):
    def __init__(self, size):
        self.values = [False] * size
        self.size = size
    def hash_value(self, value):
        return hash(value) % self.size
    def add_value(self, value):
        h = self.hash_value(value)
        self.values[h] = True
    def might_contain(self, value):
        h = self.hash_value(value)
        return self.values[h]
    def print_contents(self):
        print(self.values)

After creating a filter of size 10, adding items like 'dog', 'fish', and 'cat' sets the corresponding bits. Adding 'bird' may not change the filter if it hashes to an already‑set position.

Tokenization

Tokenization splits text into searchable units. The major segmentation uses spaces, while the minor segmentation captures additional sub‑segments. Both functions return a set of unique tokens.

def major_segments(s):
    """Split the string by spaces and return a set of tokens."""
    results = set()
    for idx, ch in enumerate(s):
        if ch in ' ':
            segment = s[last+1:idx]
            results.add(segment)
            last = idx
    # capture the last segment
    results.add(s[last+1:])
    return results

def minor_segments(s):
    """Further split each major segment by '_' and '.' characters."""
    results = set()
    for idx, ch in enumerate(s):
        if ch in '_.':
            segment = s[last+1:idx]
            results.add(segment)
            last = idx
    results.add(s[last+1:])
    return results

def segments(event):
    results = set()
    for major in major_segments(event):
        for minor in minor_segments(major):
            results.add(minor)
    return results

Search Engine (Splunk)

The Splunk class maintains a Bloom filter, a dictionary mapping terms to event IDs, and a list of events. Adding an event tokenizes the event, updates the Bloom filter, and records the mapping.

class Splunk(object):
    def __init__(self):
        self.bf = Bloomfilter(64)
        self.terms = {}
        self.events = []
    def add_event(self, event):
        event_id = len(self.events)
        self.events.append(event)
        for term in segments(event):
            self.bf.add_value(term)
            if term not in self.terms:
                self.terms[term] = set()
            self.terms[term].add(event_id)
    def search(self, term):
        if not self.bf.might_contain(term) or term not in self.terms:
            return []
        for event_id in sorted(self.terms[term]):
            yield self.events[event_id]

Additional methods search_all, search_any, and search_all (AND query) use set intersection and union to support complex queries.

Complex Queries

Using Python set operations, the engine can efficiently handle AND (intersection) and OR (union) searches across multiple terms.

def search_all(self, terms):
    results = set(range(len(self.events)))
    for term in terms:
        if not self.bf.might_contain(term) or term not in self.terms:
            return []
        results = results.intersection(self.terms[term])
    for event_id in sorted(results):
        yield self.events[event_id]

def search_any(self, terms):
    results = set()
    for term in terms:
        if not self.bf.might_contain(term) or term not in self.terms:
            continue
        results = results.union(self.terms[term])
    for event_id in sorted(results):
        yield self.events[event_id]

The provided examples show adding events, performing simple and complex searches, and printing results. The code illustrates the fundamental mechanisms behind big‑data search systems like Splunk, though a production system would require many additional features.

All content originates from Splunk Conf2017.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Inverted Index Tokenization Bloom Filter AND/OR queries big data search

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.