Databases 17 min read

Boost PostgreSQL Full‑Text Search 3× Faster with VectorChord‑BM25

VectorChord‑BM25 is a PostgreSQL extension that adds native BM25 ranking and tokenization, delivering up to three‑fold query‑per‑second improvements over ElasticSearch while maintaining comparable relevance scores, and includes detailed installation, usage examples, and performance analysis.

ITPUB
ITPUB
ITPUB
Boost PostgreSQL Full‑Text Search 3× Faster with VectorChord‑BM25

VectorChord‑BM25 Overview

VectorChord‑BM25 is a PostgreSQL extension that adds BM25 scoring and ranking to full‑text search, using a custom bm25vector type and a BlockMax WeakAnd index for efficient score‑based filtering.

Key Features

BM25 scoring with tunable parameters k1 and b.

BlockMax WeakAnd index for fast queries on large collections.

Tokenizer support (BERT, Tocken, Unicode) with stemming and stop‑word handling.

BM25 Formula

score(Q, D) = Σ_{q∈Q} IDF(q) * (f(q, D) * (k1 + 1)) / (f(q, D) + k1 * (1 - b + b * |D|/avgdl))

where f(q,D) is term frequency, IDF(q) is inverse document frequency, |D| is document length, avgdl is average document length, and k1, b control term‑frequency impact and length normalization.

Existing PostgreSQL Full‑Text Search

PostgreSQL uses the tsvector type with GIN indexes and the ts_rank function. It lacks a native BM25 ranking, requiring post‑processing that can be costly for large result sets.

-- Create table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    content_vector tsvector
);

-- Insert data
INSERT INTO documents (content, content_vector) VALUES
('PostgreSQL is a powerful, open-source database system.', to_tsvector('english', 'PostgreSQL is a powerful, open-source database system.')),
('Full‑text search in PostgreSQL is efficient and scalable.', to_tsvector('english', 'Full‑text search in PostgreSQL is efficient and scalable.')),
('BM25 is a ranking function used by search engines.', to_tsvector('english', 'BM25 is a ranking function used by search engines.'));

-- GIN index
CREATE INDEX idx_content_vector ON documents USING GIN (content_vector);

-- Query with ts_rank
SELECT id, content,
       ts_rank(content_vector, to_tsquery('english', 'PostgreSQL & search')) AS rank
FROM documents
WHERE content_vector @@ to_tsquery('english', 'PostgreSQL & search')
ORDER BY rank DESC;

VectorChord‑BM25 Implementation

The extension provides three core components:

Tokenizer : Converts text into a bm25vector sparse representation of term IDs and frequencies.

bm25vector : Custom data type storing tokenized text suitable for BM25 scoring.

bm25vector index : Index type ( USING bm25) that accelerates search and ranking.

Example workflow:

-- Tokenize a sample sentence (BERT tokenizer)
SELECT tokenize('A quick brown fox jumps over the lazy dog.', 'Bert');

-- Table with bm25vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    passage TEXT,
    embedding bm25vector
);

INSERT INTO documents (passage) VALUES
('PostgreSQL is a powerful, open-source object‑relational database system.'),
('Full‑text search is a technique for searching plain‑text documents.'),
('Effective search ranking algorithms, such as BM25, improve results.');

-- Populate bm25vector column
UPDATE documents SET embedding = tokenize(passage, 'Bert');

-- Create BM25 index
CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops);

-- BM25 query (scores are negative; higher relevance = less negative)
SELECT id, passage,
       embedding <&> to_bm25query('documents_embedding_bm25', 'PostgreSQL', 'Bert') AS rank
FROM documents
ORDER BY rank
LIMIT 10;

Supported Tokenizers

Bert

: Pre‑trained BERT tokenizer. Tocken: Trained on the wiki-103-raw dataset with minimum frequency 10. Unicode: Builds a vocabulary from the target table; created with create_unicode_tokenizer_and_trigger(...).

Performance Evaluation

Benchmarks on the BEIR suite measured queries per second (QPS) and NDCG@10. Initial results showed 3–5× higher QPS than Elasticsearch. After aligning stop‑word lists and stemming, the QPS advantage reduced to ~40 % (e.g., 112 QPS vs 49 QPS on Top‑1000 queries, a 2.26× speed increase) while maintaining comparable NDCG@10 scores on most datasets.

Installation & Quick Start

Repository: https://github.com/tensorchord/VectorChord-bm25

Typical steps:

Build and install the extension in PostgreSQL.

Create a table with a bm25vector column.

Populate the column using tokenize(text, 'TokenizerName').

Create a BM25 index: CREATE INDEX … USING bm25 (column bm25_ops); Run queries with

column <&> to_bm25query(index_name, query_text, 'TokenizerName')

.

Future Work

Decouple tokenization into a standalone, extensible component.

Add multilingual support and customizable tokenizers.

Introduce synonym handling and further performance optimizations.

References

https://github.com/tensorchord/VectorChord-bm25

https://deepwiki.com/tensorchord/pg_tokenizer.rs

RankingBM25PostgreSQLFull-text searchDatabase Extension
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.