Boost Your Python Projects with Whoosh: A Lightweight Search Engine Tutorial

This article introduces the lightweight pure‑Python search library Whoosh, outlines its key features, demonstrates how to define a schema, build an index from a CSV of Chinese poems, and perform full‑text queries with example code, making it ideal for small search projects.

21CTO
21CTO
21CTO
Boost Your Python Projects with Whoosh: A Lightweight Search Engine Tutorial

Whoosh Overview

Whoosh, created by Matt Chaput, started as a simple search tool for Houdini 3D documentation and has grown into a mature, pure-Python search engine supporting Python 2 and 3.

Key Features

Pure Python implementation, no compiler needed.

Uses Okapi BM25F ranking by default, with other algorithms available.

Creates smaller index files compared with other engines.

Index files are Unicode encoded.

Can store arbitrary Python objects.

Official site: https://whoosh.readthedocs.io/en/latest/intro.html. Whoosh is lighter and simpler than Elasticsearch or Solr, suitable for small search projects.

Index & Query Concepts

Similar to Elasticsearch, Whoosh involves mapping (index creation) and query execution. The library’s API is straightforward for those familiar with ES.

Example Code

Data

The example uses a CSV file poem.csv containing four columns: title, dynasty, poet, content.

Schema Definition

# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json

# Create schema, stored=True makes fields searchable
schema = Schema(
    title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
    dynasty=ID(stored=True),
    poet=ID(stored=True),
    content=TEXT(stored=True, analyzer=ChineseAnalyzer())
)

Creating the Index

# Parse poem.csv
with open('poem.csv', 'r', encoding='utf-8') as f:
    texts = [_.strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]

indexdir = 'indexdir/'
if not os.path.exists(indexdir):
    os.mkdir(indexdir)
ix = create_in(indexdir, schema)

writer = ix.writer()
for i in range(1, len(texts)):
    title, dynasty, poet, content = texts[i]
    writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()

After committing, the indexdir directory contains the index files for all fields.

Searching

# Create a searcher
searcher = ix.searcher()

# Find documents where content contains "明月"
results = searcher.find("content", "明月")
print('Found %d documents.' % len(results))
for i in range(min(10, len(results)):
    print(json.dumps(results[i].fields(), ensure_ascii=False))

The script prints the number of matching documents and displays up to ten results, showing fields such as title, dynasty, poet, and content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

indexingsearch engineFull‑Text Searchexamplequerywhoosh
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.