How to Build Hybrid Vector and Full‑Text Search with PHPVector in PHP 8.2
This guide introduces PHPVector, a pure‑PHP vector database that combines HNSW‑based approximate nearest‑neighbor search with BM25 full‑text ranking, showing installation, document insertion, vector and text queries, hybrid ranking modes, configuration options, distance metrics, tuning tips, and persistence mechanisms.
Overview
PHPVector is a pure‑PHP vector database that implements Hierarchical Navigable Small World (HNSW) for approximate nearest‑neighbor search and BM25 for full‑text ranking. The two engines can be combined into a hybrid search pipeline.
Requirements
PHP 8.2 or newer
No external PHP extensions required for core functionality
Optional ext‑pcntl to enable asynchronous document writes and reduce insertion latency
Installation
composer require ezimuel/phpvectorQuick Start
1. Insert Documents
A Document holds a dense embedding vector, optional raw text (used by BM25), and arbitrary metadata. The id field is optional; if omitted a random UUID v4 is generated.
use PHPVector\Document;
use PHPVector\VectorDatabase;
$db = new VectorDatabase();
$db->addDocuments([
new Document(
id: 1,
vector: [0.12, 0.85, 0.44, 0.67],
text: 'PHP vector database with HNSW index',
metadata: ['url' => 'https://example.com/1', 'lang' => 'en']
),
new Document(
id: 2,
vector: [0.91, 0.23, 0.78, 0.05],
text: 'Approximate nearest neighbour search in PHP',
metadata: ['url' => 'https://example.com/2', 'lang' => 'en']
),
new Document(
id: 3,
vector: [0.33, 0.61, 0.19, 0.88],
text: 'BM25 full‑text ranking algorithm explained',
metadata: ['url' => 'https://example.com/3', 'lang' => 'en']
),
// No id → UUID v4 is generated automatically
new Document(
vector: [0.55, 0.42, 0.71, 0.30],
text: 'Hybrid search with Reciprocal Rank Fusion'
),
]);2. Vector Search
Find the k most similar documents to a query vector using the HNSW index.
$queryVector = [0.10, 0.80, 0.50, 0.60];
$results = $db->vectorSearch(vector: $queryVector, k: 2);
foreach ($results as $result) {
echo sprintf("[%d] score=%.4f %s
", $result->rank, $result->score, $result->document->metadata['url']);
}
// Example output:
// [1] score=0.9987 https://example.com/1
// [2] score=0.8341 https://example.com/33. Full‑Text Search
Rank documents by BM25 relevance to a textual query.
$results = $db->textSearch(query: 'nearest neighbour PHP', k: 2);
foreach ($results as $result) {
echo sprintf("[%d] score=%.4f %s
", $result->rank, $result->score, $result->document->metadata['url']);
}
// Example output:
// [1] score=1.2430 https://example.com/2
// [2] score=0.8761 https://example.com/14. Hybrid Search
Combine vector similarity scores and BM25 scores into a single ranking list.
Reciprocal Rank Fusion (RRF)
RRF is rank‑based, independent of score scales, and requires no parameter tuning.
use PHPVector\HybridMode;
$results = $db->hybridSearch(
vector: $queryVector,
text: 'vector database PHP',
k: 3,
mode: HybridMode::RRF,
);
foreach ($results as $result) {
echo sprintf("[%d] score=%.4f %s
", $result->rank, $result->score, $result->document->metadata['url']);
}Weighted Combination
Normalize both scores to the [0, 1] range and apply explicit weights.
$results = $db->hybridSearch(
vector: $queryVector,
text: 'vector database PHP',
k: 3,
mode: HybridMode::Weighted,
vectorWeight: 0.7,
textWeight: 0.3,
);Configuration
Both HNSW and BM25 engines are fully configurable via objects passed to the VectorDatabase constructor.
use PHPVector\BM25\Config as BM25Config;
use PHPVector\BM25\SimpleTokenizer;
use PHPVector\Distance;
use PHPVector\HNSW\Config as HNSWConfig;
use PHPVector\VectorDatabase;
$db = new VectorDatabase(
hnswConfig: new HNSWConfig(
M: 16, // max connections per node per layer (higher → better recall, more memory)
efConstruction: 200, // construction beam width (higher → higher graph quality, slower inserts)
efSearch: 50, // search beam width (higher → better recall, slower queries)
distance: Distance::Cosine, // Cosine | Euclidean | DotProduct | Manhattan
useHeuristic: true // diversified neighbor selection (recommended)
),
bm25Config: new BM25Config(
k1: 1.5, // term‑frequency saturation (recommended 1.2–2.0)
b: 0.75 // length normalization (0 = none, 1 = full)
),
tokenizer: new SimpleTokenizer(
stopWords: SimpleTokenizer::DEFAULT_STOP_WORDS,
minTokenLength: 2,
),
);Distance Metrics
Distance::Cosine – best for text embeddings and normalized vectors
Distance::Euclidean – suitable for raw, unnormalized vectors
Distance::DotProduct – works with unit‑normalized vectors and is faster than cosine
Distance::Manhattan – robust to outliers, ideal for sparse vectors
HNSW Tuning Cheat Sheet
Increase recall – raise efSearch or efConstruction Speed up queries – lower efSearch Reduce memory usage – lower M Better graph for clustered data – keep
useHeuristic: truePersistence
PHPVector stores each database in a dedicated directory containing the HNSW graph, BM25 index, and per‑document files. This layout provides low memory usage on load and low insertion latency.
Folder Structure
/var/data/mydb/
meta.json // distance metric, dimension, document‑ID map
hnsw.bin // HNSW graph (vectors + connections)
bm25.bin // BM25 inverted index
docs/
0.bin // document 0 (id, text, metadata)
1.bin // document 1
…Saving
Pass a path argument to the constructor to enable persistence. Each call to addDocument() writes a document file (asynchronously if ext‑pcntl is available). Call save() once to flush the HNSW graph and BM25 index to disk.
use PHPVector\Document;
use PHPVector\VectorDatabase;
$db = new VectorDatabase(path: '/var/data/mydb');
$db->addDocuments([
new Document(id: 1, vector: [0.12, 0.85, 0.44], text: 'PHP vector search', metadata: ['source' => 'blog']),
new Document(id: 2, vector: [0.91, 0.23, 0.78], text: 'Approximate nearest neighbour'),
// ... insert thousands of documents
]);
$db->save();Loading
Use VectorDatabase::open() to load a previously saved database.
$db = VectorDatabase::open('/var/data/mydb');Open Source Tech Hub
Sharing cutting-edge internet technologies and practical AI resources.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
