Boost PHP AI Performance with MemVector: In‑Process Vector Search & Reranking

MemVector is a high‑performance PHP extension that brings native AI capabilities—embedding, vector similarity search, and cross‑encoder reranking—directly into the PHP process, eliminating external services and delivering sub‑10 ms latency for full RAG pipelines on commodity CPUs.

Open Source Tech Hub
Open Source Tech Hub
Open Source Tech Hub
Boost PHP AI Performance with MemVector: In‑Process Vector Search & Reranking

Overview

MemVector is a high‑performance PHP extension that provides native AI infrastructure for PHP. It implements vector similarity search, text embedding, and cross‑encoder reranking directly inside the PHP process using C++17 and AVX2 SIMD acceleration. No external vector database, embedding API, or network round‑trip is required; typical operations finish within 10 ms.

Quick Start

Local GGUF embedding model (requires --with-llama):

// Using local embedding (requires --with-llama)
$emb = new MemVectorEmbedding('/models/all-MiniLM-L6-v2.Q8_0.gguf');
$store = new MemVectorStore(null, ['dimensions' => $emb->dimensions()]);

$store->set('php', $emb->embed('PHP is a server‑side scripting language'), 'lang');
$store->set('python', $emb->embed('Python is used for machine learning'), 'lang');
$store->set('gravity', $emb->embed('Gravity pulls objects toward the earth'), 'science');
$store->set('dna', $emb->embed('DNA encodes genetic information'), 'science');

$results = $store->search($emb->embed('programming languages'), 2);
// $results => [['key' => 'php', 'score' => 0.82, 'metadata' => 'lang'],
//               ['key' => 'python', 'score' => 0.79, 'metadata' => 'lang']]

Two‑stage retrieval (vector search + cross‑encoder reranking):

// Two‑stage retrieval (requires --with-llama)
$rr = new MemVectorReranker('/models/bge-reranker-v2-m3-Q8_0.gguf');
$candidates = $store->search($emb->embed('programming languages'), 50);
$results = $rr->rerank('programming languages', $candidates, 5);

Custom vectors without llama.cpp:

$store = new MemVectorStore('/data/vectors', ['dimensions' => 1536]);
$store->set('doc_1', $openai_embedding, '{"title": "Introduction"}');
$results = $store->search($query_embedding, 10);

Features

Key‑Value API : set(key, vector), get(key), delete(key), batchSet() (upsert supported)

Text Embedding : optional llama.cpp integration for local GGUF models

Cross‑Encoder Reranking : fast vector search followed by high‑precision reranking

Storage Modes : in‑memory, disk (mmap‑backed), shared memory (cross‑process)

HNSW Index : built on‑the‑fly for approximate nearest‑neighbor search

Vector Quantization : F16, Int8, binary, product quantization (PQ)

Distance Metrics : cosine, dot product, Euclidean, Manhattan

Lock‑Free Concurrency : uses std::atomic for thread‑safe reads/writes

AVX2 SIMD : accelerates cosine and dot‑product calculations

Multi‑Segment Architecture : fixed‑capacity segments grow automatically

JSONL Import/Export : dump() and load() for backup and migration

Memory Limits : caps total mmap usage across segments

Performance

Embedding Generation

Method                         Single‑call latency   Cost
OpenAI API (text‑embedding‑3‑small)   50‑200 ms (network)   $0.02 / 1M tokens
Cohere API (embed‑english‑v3.0)       50‑200 ms (network)   $0.10 / 1M tokens
MemVector + local GGUF model           5‑15 ms (in‑process)  Free

Local embedding is 10‑40× faster than cloud APIs and incurs no per‑token charge.

Vector Search

Method                         Query latency   Remarks
Pinecone / Qdrant / Weaviate (cloud)   10‑50 ms (network)   Managed service
Pinecone / Qdrant / Weaviate (self‑hosted)   5‑20 ms (network)   Separate process, TCP/gRPC overhead
PostgreSQL + pgvector               5‑50 ms (query + network)   Shared DB, connection‑pool overhead
MemVector (in‑process)               0.1‑5 ms   No network, no serialization

Full RAG Pipeline (embed + search + rerank)

Method                                 Total latency   Network round‑trips
OpenAI embed + Pinecone search               100‑400 ms   2
OpenAI embed + Pinecone search + Cohere rerank   200‑600 ms   3
MemVector (all in‑process)                    10‑30 ms   0

Memory Consumption

Component                                 RSS
MemVector extension (no model)            ~1 MB
+ embedding model (all‑MiniLM‑L6‑v2, 24 MB)   ~33 MB
+ rerank model (bge‑reranker‑v2‑m3, 636 MB)   ~200 MB
+ 100 k vectors (384‑dim, f32)               ~150 MB
+ 100 k vectors (384‑dim, int8 quant)       ~40 MB

Model weights are loaded via mmap() and shared across PHP‑FPM or OpenSwoole workers through the OS page cache.

Requirements

PHP 8.1+

C++17 compiler (GCC 7+ or Clang 5+)

Optional: llama.cpp for embedding and reranking

Build

phpize
./configure --enable-memvector
make
make test

Install llama.cpp (optional)

Linux – build from source:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=OFF
cmake --build build --config Release -j$(nproc)
sudo cmake --install build --prefix /usr/local
sudo ldconfig

To enable GPU acceleration replace -DGGML_CUDA=OFF with -DGGML_CUDA=ON (requires CUDA toolkit).

Build with llama.cpp support

phpize
./configure --enable-memvector --with-llama=/usr/local
make
make test

If llama.cpp is installed in a non‑standard location, specify --with-llama=DIR.

Model Downloads

Embedding model (recommended 24 MB, 384‑dim):

curl -L -o all-MiniLM-L6-v2-Q8_0.gguf \
    https://huggingface.co/leliuga/all-MiniLM-L6-v2-GGUF/resolve/main/all-MiniLM-L6-v2.Q8_0.gguf

Reranking model (GGUF cross‑encoder):

curl -L -o bge-reranker-v2-m3-Q8_0.gguf \
    https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf

Configuration Options

--enable-memvector

: enable the extension (required) --enable-memvector-avx2: AVX2 SIMD (auto‑detected by default) --with-llama[=DIR]: enable llama.cpp support for embedding and reranking

Runtime feature detection via compile‑time constants:

if (defined('MEMVECTOR_SHM')) {
    // shared‑memory mode available
    $store = new MemVectorStore('mystore', ['storage' => 'shm', 'dimensions' => 128]);
}

if (defined('MEMVECTOR_LLAMA')) {
    // llama.cpp support compiled in
    $emb = new MemVectorEmbedding('/path/to/embedding-model.gguf');
    $rr  = new MemVectorReranker('/path/to/reranker-model.gguf');
}

PHP API

Class: MemVectorStore

Constructor __construct(?string $dir = null, ?array $options = null) – creates or opens a vector store. Storage mode is chosen by the storage option: memory, disk, or shm.

Methods

set(string $key, array $vector, ?string $metadata = null): bool

– insert or replace a vector (upsert semantics). batchSet(array $batch): int – insert or replace multiple vectors; returns the number of successfully stored items. get(string $key): ?array – retrieve a stored vector and its metadata. delete(string $key): bool – remove a vector. search(array $queryVector, int $k): array – return the top‑k nearest keys with similarity scores.

Key constraints: unique string up to 63 characters. Vector must match the configured dimensionality. Optional metadata is a JSON string up to 4096 bytes.

Class: MemVectorEmbedding

Provides embed(string $text): array to obtain a vector from a local GGUF model (requires --with-llama).

Class: MemVectorReranker

Provides rerank(string $query, array $candidates, int $topK): array to perform cross‑encoder reranking of candidate vectors.

AIRAGEmbeddingPHPExtension
Open Source Tech Hub
Written by

Open Source Tech Hub

Sharing cutting-edge internet technologies and practical AI resources.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.