Boost Your Python Projects with Whoosh: A Lightweight Search Engine Tutorial
This article introduces the lightweight pure‑Python search library Whoosh, outlines its key features, demonstrates how to define a schema, build an index from a CSV of Chinese poems, and perform full‑text queries with example code, making it ideal for small search projects.
Whoosh Overview
Whoosh, created by Matt Chaput, started as a simple search tool for Houdini 3D documentation and has grown into a mature, pure-Python search engine supporting Python 2 and 3.
Key Features
Pure Python implementation, no compiler needed.
Uses Okapi BM25F ranking by default, with other algorithms available.
Creates smaller index files compared with other engines.
Index files are Unicode encoded.
Can store arbitrary Python objects.
Official site: https://whoosh.readthedocs.io/en/latest/intro.html. Whoosh is lighter and simpler than Elasticsearch or Solr, suitable for small search projects.
Index & Query Concepts
Similar to Elasticsearch, Whoosh involves mapping (index creation) and query execution. The library’s API is straightforward for those familiar with ES.
Example Code
Data
The example uses a CSV file poem.csv containing four columns: title, dynasty, poet, content.
Schema Definition
# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json
# Create schema, stored=True makes fields searchable
schema = Schema(
title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
dynasty=ID(stored=True),
poet=ID(stored=True),
content=TEXT(stored=True, analyzer=ChineseAnalyzer())
)Creating the Index
# Parse poem.csv
with open('poem.csv', 'r', encoding='utf-8') as f:
texts = [_.strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
os.mkdir(indexdir)
ix = create_in(indexdir, schema)
writer = ix.writer()
for i in range(1, len(texts)):
title, dynasty, poet, content = texts[i]
writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()After committing, the indexdir directory contains the index files for all fields.
Searching
# Create a searcher
searcher = ix.searcher()
# Find documents where content contains "明月"
results = searcher.find("content", "明月")
print('Found %d documents.' % len(results))
for i in range(min(10, len(results)):
print(json.dumps(results[i].fields(), ensure_ascii=False))The script prints the number of matching documents and displays up to ten results, showing fields such as title, dynasty, poet, and content.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
