Master PostgreSQL Full-Text Search: From Basics to Advanced Chinese Tokenization
This article explains PostgreSQL's native full‑text search, its core concepts of tsvector and tsquery, demonstrates how to use built‑in functions and operators, compares built‑in, zhparser, and pg_search extensions for Chinese tokenization, and provides best‑practice tips for indexing, triggers, and performance optimization.
Why PostgreSQL Full‑Text Search Matters
PostgreSQL provides a lightweight, production‑ready full‑text search (FTS) engine that can be combined with data storage, making it suitable for AI‑driven large‑scale text retrieval.
Full‑Text Search Basics
FTS uses an inverted index, relevance ranking, and stemming to locate relevant documents quickly.
Built‑in PostgreSQL FTS
Core Types: tsvector and tsquery
tsvectorstores a document as a list of lexemes with position information.
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector; to_tsvector()performs language‑aware tokenisation, lower‑casing, stemming and stop‑word removal.
Lower‑casing: The Fat Rats → the fat rats Stemming: Rats → rat Stop‑word removal: a, on, it are discarded
Queries are expressed with tsquery using Boolean operators ( &, |, !) and matched with the @@ operator.
SELECT to_tsvector('fat cats ate rats') @@ to_tsquery('cat & rat'); -- returns trueWhen a column is of type text , PostgreSQL implicitly calls to_tsvector() for comparison.
Limitations of the Built‑in Parser
The default parser splits on whitespace, which works for English but treats an entire Chinese sentence as a single token, preventing partial matches.
Chinese Tokenisation with zhparser
zhparseruses the SCWS library to split Chinese sentences into individual words. The extension must be compiled and installed.
CREATE EXTENSION zhparser; CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser); ALTER TEXT SEARCH CONFIGURATION chinese
ADD MAPPING FOR n, v, a, i, e, l WITH simple;The simple dictionary stores tokens without further processing, which is appropriate for Chinese where stemming is unnecessary.
Smart Search with pg_search (ParadeDB)
Install the pg_search extension, add it to shared_preload_libraries in postgresql.conf, and restart PostgreSQL.
shared_preload_libraries = 'pg_search' CREATE EXTENSION pg_search; pg_searchprovides BM25 ranking and supports multiple tokenisers, including Chinese tokenisers such as chinese_compatible and jieba.
Creating a BM25 Index
CREATE INDEX ch_idx
ON public.search_wenzhang
USING bm25 (id, content)
WITH (
key_field = 'id',
text_fields = '{"content": {"tokenizer": {"type": "chinese_compatible"}}}'
);Query syntax uses the @@@ operator together with paradedb.parse():
SELECT *
FROM search_wenzhang
WHERE id @@@ paradedb.parse('content:问题')
LIMIT 10;Comparison of zhparser vs pg_search
For the keyword “问题”, both extensions return the same rows. For the single character “问”, only pg_search matches because its tokenizer supports finer granularity.
Selection Advice and Best Practices
Three Solution Options
Built‑in FTS – sufficient for English, no extra installation. zhparser – adds Chinese boolean search, requires compilation. pg_search – best for Chinese with BM25 ranking, easy to install.
Trigger to Auto‑Maintain tsvector
CREATE OR REPLACE FUNCTION update_wenzhang_tsvector()
RETURNS TRIGGER AS $$
BEGIN
NEW.content_tsvector := to_tsvector('english',
COALESCE(NEW.title,'') || ' ' || COALESCE(NEW.content,''));
NEW.content_zh_tsvector := to_tsvector('chinese',
COALESCE(NEW.title,'') || ' ' || COALESCE(NEW.content,''));
RETURN NEW;
END;
$$ LANGUAGE plpgsql; CREATE TRIGGER wenzhang_tsvector_update
BEFORE INSERT OR UPDATE ON search_wenzhang
FOR EACH ROW EXECUTE FUNCTION update_wenzhang_tsvector();Performance Optimisation Checklist
Create GIN (built‑in/zhparser) or BM25 (pg_search) indexes.
Pre‑generate tsvector columns; avoid calling to_tsvector() at query time.
Partition large tables by time range to reduce scan volume.
Run VACUUM ANALYZE regularly and REINDEX when indexes become stale.
Replace LIKE '%keyword%' patterns with full‑text search.
Benchmark Results
On a 2‑million‑row table, a GIN index turned a full‑table scan into an index lookup, delivering roughly a 7× performance boost.
Common Issues Quick‑Reference
to_tsvector('chinese', …)returns empty → missing mapping; add
ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING … WITH simple.
Chinese queries only match whole words → use chinese_compatible or jieba tokenisers for finer granularity. pg_search errors → ensure shared_preload_libraries includes pg_search and restart.
Slow queries → check index health, consider partitioning, run REINDEX.
Unexpected tokenisation results → verify SCWS dictionary version or switch to jieba via pg_search.
Conclusion
PostgreSQL’s full‑text search ecosystem has matured:
Built‑in FTS – robust for English without extra setup. zhparser – enables Chinese boolean search, requires compilation. pg_search – combines Chinese tokenisation with BM25 ranking, easy to install and recommended for production.
In practice, combine the strengths of multiple solutions: use zhparser or pg_search for Chinese, while retaining built‑in FTS for English.
Illustrations
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
