Boost PostgreSQL Full‑Text Search 3× Faster with VectorChord‑BM25
VectorChord‑BM25 is a PostgreSQL extension that adds native BM25 ranking and tokenization, delivering up to three‑fold query‑per‑second improvements over ElasticSearch while maintaining comparable relevance scores, and includes detailed installation, usage examples, and performance analysis.
VectorChord‑BM25 Overview
VectorChord‑BM25 is a PostgreSQL extension that adds BM25 scoring and ranking to full‑text search, using a custom bm25vector type and a BlockMax WeakAnd index for efficient score‑based filtering.
Key Features
BM25 scoring with tunable parameters k1 and b.
BlockMax WeakAnd index for fast queries on large collections.
Tokenizer support (BERT, Tocken, Unicode) with stemming and stop‑word handling.
BM25 Formula
score(Q, D) = Σ_{q∈Q} IDF(q) * (f(q, D) * (k1 + 1)) / (f(q, D) + k1 * (1 - b + b * |D|/avgdl))where f(q,D) is term frequency, IDF(q) is inverse document frequency, |D| is document length, avgdl is average document length, and k1, b control term‑frequency impact and length normalization.
Existing PostgreSQL Full‑Text Search
PostgreSQL uses the tsvector type with GIN indexes and the ts_rank function. It lacks a native BM25 ranking, requiring post‑processing that can be costly for large result sets.
-- Create table
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
content_vector tsvector
);
-- Insert data
INSERT INTO documents (content, content_vector) VALUES
('PostgreSQL is a powerful, open-source database system.', to_tsvector('english', 'PostgreSQL is a powerful, open-source database system.')),
('Full‑text search in PostgreSQL is efficient and scalable.', to_tsvector('english', 'Full‑text search in PostgreSQL is efficient and scalable.')),
('BM25 is a ranking function used by search engines.', to_tsvector('english', 'BM25 is a ranking function used by search engines.'));
-- GIN index
CREATE INDEX idx_content_vector ON documents USING GIN (content_vector);
-- Query with ts_rank
SELECT id, content,
ts_rank(content_vector, to_tsquery('english', 'PostgreSQL & search')) AS rank
FROM documents
WHERE content_vector @@ to_tsquery('english', 'PostgreSQL & search')
ORDER BY rank DESC;VectorChord‑BM25 Implementation
The extension provides three core components:
Tokenizer : Converts text into a bm25vector sparse representation of term IDs and frequencies.
bm25vector : Custom data type storing tokenized text suitable for BM25 scoring.
bm25vector index : Index type ( USING bm25) that accelerates search and ranking.
Example workflow:
-- Tokenize a sample sentence (BERT tokenizer)
SELECT tokenize('A quick brown fox jumps over the lazy dog.', 'Bert');
-- Table with bm25vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
passage TEXT,
embedding bm25vector
);
INSERT INTO documents (passage) VALUES
('PostgreSQL is a powerful, open-source object‑relational database system.'),
('Full‑text search is a technique for searching plain‑text documents.'),
('Effective search ranking algorithms, such as BM25, improve results.');
-- Populate bm25vector column
UPDATE documents SET embedding = tokenize(passage, 'Bert');
-- Create BM25 index
CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops);
-- BM25 query (scores are negative; higher relevance = less negative)
SELECT id, passage,
embedding <&> to_bm25query('documents_embedding_bm25', 'PostgreSQL', 'Bert') AS rank
FROM documents
ORDER BY rank
LIMIT 10;Supported Tokenizers
Bert: Pre‑trained BERT tokenizer. Tocken: Trained on the wiki-103-raw dataset with minimum frequency 10. Unicode: Builds a vocabulary from the target table; created with create_unicode_tokenizer_and_trigger(...).
Performance Evaluation
Benchmarks on the BEIR suite measured queries per second (QPS) and NDCG@10. Initial results showed 3–5× higher QPS than Elasticsearch. After aligning stop‑word lists and stemming, the QPS advantage reduced to ~40 % (e.g., 112 QPS vs 49 QPS on Top‑1000 queries, a 2.26× speed increase) while maintaining comparable NDCG@10 scores on most datasets.
Installation & Quick Start
Repository: https://github.com/tensorchord/VectorChord-bm25
Typical steps:
Build and install the extension in PostgreSQL.
Create a table with a bm25vector column.
Populate the column using tokenize(text, 'TokenizerName').
Create a BM25 index: CREATE INDEX … USING bm25 (column bm25_ops); Run queries with
column <&> to_bm25query(index_name, query_text, 'TokenizerName').
Future Work
Decouple tokenization into a standalone, extensible component.
Add multilingual support and customizable tokenizers.
Introduce synonym handling and further performance optimizations.
References
https://github.com/tensorchord/VectorChord-bm25
https://deepwiki.com/tensorchord/pg_tokenizer.rs
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
