Databases 14 min read

Master PostgreSQL Full-Text Search: From Basics to Advanced Chinese Tokenization

This article explains PostgreSQL's native full‑text search, its core concepts of tsvector and tsquery, demonstrates how to use built‑in functions and operators, compares built‑in, zhparser, and pg_search extensions for Chinese tokenization, and provides best‑practice tips for indexing, triggers, and performance optimization.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Master PostgreSQL Full-Text Search: From Basics to Advanced Chinese Tokenization

Why PostgreSQL Full‑Text Search Matters

PostgreSQL provides a lightweight, production‑ready full‑text search (FTS) engine that can be combined with data storage, making it suitable for AI‑driven large‑scale text retrieval.

Full‑Text Search Basics

FTS uses an inverted index, relevance ranking, and stemming to locate relevant documents quickly.

Built‑in PostgreSQL FTS

Core Types: tsvector and tsquery

tsvector

stores a document as a list of lexemes with position information.

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
to_tsvector()

performs language‑aware tokenisation, lower‑casing, stemming and stop‑word removal.

Lower‑casing: The Fat Ratsthe fat rats Stemming: Ratsrat Stop‑word removal: a, on, it are discarded

Queries are expressed with tsquery using Boolean operators ( &, |, !) and matched with the @@ operator.

SELECT to_tsvector('fat cats ate rats') @@ to_tsquery('cat & rat');  -- returns true
When a column is of type text , PostgreSQL implicitly calls to_tsvector() for comparison.

Limitations of the Built‑in Parser

The default parser splits on whitespace, which works for English but treats an entire Chinese sentence as a single token, preventing partial matches.

Chinese Tokenisation with zhparser

zhparser

uses the SCWS library to split Chinese sentences into individual words. The extension must be compiled and installed.

CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION chinese
  ADD MAPPING FOR n, v, a, i, e, l WITH simple;

The simple dictionary stores tokens without further processing, which is appropriate for Chinese where stemming is unnecessary.

Smart Search with pg_search (ParadeDB)

Install the pg_search extension, add it to shared_preload_libraries in postgresql.conf, and restart PostgreSQL.

shared_preload_libraries = 'pg_search'
CREATE EXTENSION pg_search;
pg_search

provides BM25 ranking and supports multiple tokenisers, including Chinese tokenisers such as chinese_compatible and jieba.

Creating a BM25 Index

CREATE INDEX ch_idx
  ON public.search_wenzhang
  USING bm25 (id, content)
  WITH (
    key_field = 'id',
    text_fields = '{"content": {"tokenizer": {"type": "chinese_compatible"}}}'
  );

Query syntax uses the @@@ operator together with paradedb.parse():

SELECT *
FROM search_wenzhang
WHERE id @@@ paradedb.parse('content:问题')
LIMIT 10;

Comparison of zhparser vs pg_search

For the keyword “问题”, both extensions return the same rows. For the single character “问”, only pg_search matches because its tokenizer supports finer granularity.

Selection Advice and Best Practices

Three Solution Options

Built‑in FTS – sufficient for English, no extra installation. zhparser – adds Chinese boolean search, requires compilation. pg_search – best for Chinese with BM25 ranking, easy to install.

Trigger to Auto‑Maintain tsvector

CREATE OR REPLACE FUNCTION update_wenzhang_tsvector()
RETURNS TRIGGER AS $$
BEGIN
  NEW.content_tsvector := to_tsvector('english',
    COALESCE(NEW.title,'') || ' ' || COALESCE(NEW.content,''));
  NEW.content_zh_tsvector := to_tsvector('chinese',
    COALESCE(NEW.title,'') || ' ' || COALESCE(NEW.content,''));
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER wenzhang_tsvector_update
BEFORE INSERT OR UPDATE ON search_wenzhang
FOR EACH ROW EXECUTE FUNCTION update_wenzhang_tsvector();

Performance Optimisation Checklist

Create GIN (built‑in/zhparser) or BM25 (pg_search) indexes.

Pre‑generate tsvector columns; avoid calling to_tsvector() at query time.

Partition large tables by time range to reduce scan volume.

Run VACUUM ANALYZE regularly and REINDEX when indexes become stale.

Replace LIKE '%keyword%' patterns with full‑text search.

Benchmark Results

On a 2‑million‑row table, a GIN index turned a full‑table scan into an index lookup, delivering roughly a 7× performance boost.

Common Issues Quick‑Reference

to_tsvector('chinese', …)

returns empty → missing mapping; add

ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING … WITH simple

.

Chinese queries only match whole words → use chinese_compatible or jieba tokenisers for finer granularity. pg_search errors → ensure shared_preload_libraries includes pg_search and restart.

Slow queries → check index health, consider partitioning, run REINDEX.

Unexpected tokenisation results → verify SCWS dictionary version or switch to jieba via pg_search.

Conclusion

PostgreSQL’s full‑text search ecosystem has matured:

Built‑in FTS – robust for English without extra setup. zhparser – enables Chinese boolean search, requires compilation. pg_search – combines Chinese tokenisation with BM25 ranking, easy to install and recommended for production.

In practice, combine the strengths of multiple solutions: use zhparser or pg_search for Chinese, while retaining built‑in FTS for English.

Illustrations

Full‑text search architecture
Full‑text search architecture
Chinese tokenisation with zhparser
Chinese tokenisation with zhparser
pg_search BM25 index
pg_search BM25 index
Performance benchmark
Performance benchmark
BM25PostgreSQLFull-text searchdatabase indexingChinese Tokenizationpg_searchzhparser
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.