Vector Database Basics: Embeddings, Similarity Search, and Index Structures
This article explains how embeddings turn text into high‑dimensional vectors, compares commercial and open‑source embedding models, details cosine, Euclidean and inner‑product similarity metrics, reviews common index structures such as Flat, IVF, HNSW and PQ, and shows how to choose and use a vector database with LangChain.js while avoiding typical pitfalls.
01 What is Embedding: The Bridge from Text to Vectors
Embedding maps human language into a high‑dimensional mathematical space so that semantically similar texts become close in distance. Each piece of text is converted by an embedding model into a fixed‑length floating‑point array; for example, OpenAI’s text-embedding-3-small produces a 1536‑dimensional vector.
In LangChain.js the process is only a few lines of code:
import { OpenAIEmbeddings } from "@langchain/openai";
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
dimensions: 1536 // optional, controls output dimension
});
// Single‑text embedding
const vector = await embeddings.embedQuery("LangChain is an AI application framework");
console.log(vector.length); // 1536
console.log(vector.slice(0, 5)); // [0.0123, -0.0456, 0.0789, ...]
// Batch embedding for storage
const vectors = await embeddings.embedDocuments([
"LangChain is an AI application framework",
"Vector databases store and retrieve vectors",
"RAG stands for Retrieval‑Augmented Generation"
]);
console.log(vectors.length); // 3Key distinction: embedQuery is for a single query vector, while embedDocuments is for bulk document embedding. Mixing them can degrade retrieval quality because some models treat queries and documents differently.
02 Embedding Model Selection: Balancing Cost, Quality, and Speed
There is no universally "best" embedding model; the optimal choice depends on language, budget, and performance requirements.
Commercial models (OpenAI) text-embedding-3-small: 1536 dimensions (adjustable), 8191 max tokens, best price‑performance for most cases. text-embedding-3-large: 3072 dimensions, double the cost, higher accuracy. text-embedding-ada-002: older 1536‑dim model, not recommended for new projects.
Open‑source models BGE-large-zh-v1.5: 1024 dimensions, strongest Chinese performance. BGE-m3: 1024 dimensions, multilingual, multi‑granularity. E5-large-v2: 1024 dimensions, excellent English results. GTE-large: 1024 dimensions, from the Tongyi Qianwen team.
Selection flow (simplified):
if (data is primarily Chinese) {
if (budget is sufficient) {
use text-embedding-3-small; // simple and cheap
} else {
deploy BGE-large-zh-v1.5 or BGE-m3 locally;
}
} else {
if (multilingual) {
use BGE-m3 or text-embedding-3-small;
} else {
use E5-large-v2 or text-embedding-3-small;
}
}In LangChain.js you can load an open‑source model via HuggingFace or a local Ollama server:
import { OllamaEmbeddings } from "@langchain/ollama";
const embeddings = new OllamaEmbeddings({
model: "bge-large-zh-v1.5",
baseUrl: "http://localhost:11434"
});
const vector = await embeddings.embedQuery("Vector database principles");
console.log(vector.length); // 102403 Similarity Computation: Three Distance Metrics
After converting text to vectors, finding the most relevant content becomes a nearest‑neighbor search. The choice of distance metric dramatically affects results.
Cosine Similarity : measures the angle between vectors, ignores magnitude. Value range [-1, 1]; 1 means identical direction. Most embeddings are already L2‑normalized, so cosine similarity equals inner product.
Euclidean Distance (L2) : straight‑line distance; smaller values indicate higher similarity. Sensitive to vector length, so un‑normalized embeddings can give misleading results.
Inner Product (Dot Product) : sum of element‑wise products. When vectors are normalized, inner product equals cosine similarity and is the fastest to compute.
Choosing the metric:
if (embedding output is normalized) {
// Use inner product for speed; equivalent to cosine
metric = "IP";
} else if (uncertain) {
// Safe default
metric = "COSINE";
} else {
// Non‑normalized vectors – either normalize first or use cosine
metric = "COSINE";
}Example implementations in TypeScript:
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, normA = 0, normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
function euclideanDistance(a: number[], b: number[]): number {
let sum = 0;
for (let i = 0; i < a.length; i++) sum += (a[i] - b[i]) ** 2;
return Math.sqrt(sum);
}
function innerProduct(a: number[], b: number[]): number {
let sum = 0;
for (let i = 0; i < a.length; i++) sum += a[i] * b[i];
return sum;
}04 Vector Index Structures: From Brute‑Force to Millisecond Retrieval
Scanning every vector (Flat / Brute‑Force) becomes infeasible at scale—searching a million 1536‑dim vectors can take seconds. Index structures accelerate retrieval.
Flat (Brute‑Force) : 100 % recall, O(n) time, suitable for < 100 k vectors or when exact results are required.
IVF (Inverted File Index) : clusters vectors with K‑Means, searches only the nearest clusters (parameter nprobe). Approximate recall, O(K + nprobe × n/K) time, works for millions of vectors.
HNSW (Hierarchical Navigable Small World) : multi‑layer graph; top layer quickly narrows the region, lower layers refine the search. Typically >95 % recall, O(log n) time, memory‑heavy, best for 10 k–50 M vectors.
PQ (Product Quantization) : compresses high‑dim vectors into low‑precision codes (e.g., 8 bytes per vector), drastically reducing storage and speeding up distance calculations at the cost of accuracy. Ideal for >10 M vectors when exact recall is not critical.
Overall comparison (simplified):
Index | Accuracy | Speed | Memory | Suitable Data Size
-------|----------|---------|--------|-------------------
Flat | ★★★★★ | ★☆☆☆☆ | ★★★☆☆ | < 100k
IVF | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | 100k‑5M
HNSW | ★★★★★ | ★★★★★ | ★★☆☆☆ | 100k‑50M
PQ | ★★★☆☆ | ★★★★☆ | ★★★★★ | >10M
IVF+PQ| ★★★★☆ | ★★★★☆ | ★★★★☆ | >10M05 Vector Database Selection: One‑Page Decision Matrix
A vector database combines an index engine, metadata storage, filtering, CRUD, and optional distributed capabilities. Pure index libraries like FAISS are insufficient for production.
Milvus : open‑source, self‑hosted, distributed, strong metadata filtering, supports billions of vectors, medium learning curve.
Pinecone : fully managed SaaS, extremely low entry barrier, strong metadata filtering, supports up to a billion vectors.
Chroma : open‑source, lightweight, ideal for prototypes or small projects (< 1 M vectors).
FAISS : pure index library, best for offline experiments or when you need full control.
Weaviate : open‑source, supports hybrid (vector + keyword) search, good for production with moderate scale.
Decision flow (textual):
if (project is prototype or < 1M vectors) {
use Chroma;
} else if (no ops team, want managed service) {
use Pinecone;
} else if (large‑scale production, self‑control) {
use Milvus;
} else if (offline experiment only) {
use FAISS;
} else if (need hybrid search) {
use Weaviate or Milvus;
}06 Practical Vector Store with LangChain.js
LangChain.js provides a unified VectorStore interface; swapping the underlying database only requires changing the initialization code.
Using Chroma for Quick Start
import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { Document } from "@langchain/core/documents";
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const docs = [
new Document({ pageContent: "LangChain is a framework for building AI applications", metadata: { source: "docs", category: "framework" } }),
new Document({ pageContent: "Vector databases store and retrieve high‑dimensional vectors", metadata: { source: "docs", category: "database" } }),
new Document({ pageContent: "RAG enhances LLM answers by retrieving external knowledge", metadata: { source: "blog", category: "rag" } }),
new Document({ pageContent: "Embedding models convert text into dense vector representations", metadata: { source: "docs", category: "embedding" } })
];
const vectorStore = await Chroma.fromDocuments(docs, embeddings, {
collectionName: "my-collection",
url: "http://localhost:8000"
});
// Similarity search (top‑2)
const results = await vectorStore.similaritySearch("What is a vector database?", 2);
console.log(results);Metadata‑Filtered Search
// Search only documents where source = "docs"
const filtered = await vectorStore.similaritySearch(
"What is a vector database?",
2,
{ source: "docs" } // metadata filter
);
// Search with similarity scores
const withScores = await vectorStore.similaritySearchWithScore(
"What is a vector database?",
3
);
for (const [doc, score] of withScores) {
console.log(`[${score.toFixed(4)}] ${doc.pageContent}`);
}Using FAISS for Local Experiments
import { OpenAIEmbeddings } from "@langchain/openai";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { Document } from "@langchain/core/documents";
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const docs = [
new Document({ pageContent: "TypeScript is a superset of JavaScript" }),
new Document({ pageContent: "Python is a dynamically typed language" }),
new Document({ pageContent: "Rust is known for memory safety" })
];
const vectorStore = await FaissStore.fromDocuments(docs, embeddings);
const results = await vectorStore.similaritySearch("static typed language", 2);
console.log(results[0].pageContent); // TypeScript is a superset of JavaScript
await vectorStore.save("./faiss-index");
const loaded = await FaissStore.load("./faiss-index", embeddings);Connecting Milvus for Production
import { OpenAIEmbeddings } from "@langchain/openai";
import { Milvus } from "@langchain/community/vectorstores/milvus";
import { Document } from "@langchain/core/documents";
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const docs = [
new Document({ pageContent: "HNSW is the most popular vector index algorithm", metadata: { topic: "index", level: "advanced" } }),
new Document({ pageContent: "IVF reduces search range by clustering", metadata: { topic: "index", level: "intermediate" } })
];
const vectorStore = await Milvus.fromDocuments(docs, embeddings, {
collectionName: "langchain_demo",
url: "http://localhost:19530",
indexCreateParams: {
index_type: "HNSW",
metric_type: "IP", // inner product, vectors are normalized
params: JSON.stringify({ M: 16, efConstruction: 256 })
},
searchParams: { ef: 128 } // search‑time accuracy parameter
});
const results = await vectorStore.similaritySearch("What are the pros and cons of HNSW?", 2);
console.log(results);07 Full RAG Retrieval Pipeline
The end‑to‑end RAG flow consists of offline indexing and online querying. Each stage can introduce failures: poor splitting, wrong embedding model, unsuitable index, or an ill‑chosen top‑K value.
Offline stage:
Raw documents → TextSplitter → chunks → embedDocuments → vectors → write to VectorStore (index + metadata)
Online stage:
User query → embedQuery → ANN search → top‑K chunks → combine into prompt → LLM generates answer08 Common Pitfalls and Remedies
Pitfall 1 – Dimension Mismatch
Switching embedding models without re‑embedding existing data leads to errors (e.g., stored vectors are 1536‑dim while queries are 1024‑dim). The fix is to re‑embed all documents and rebuild the index.
Pitfall 2 – Normalization Issues
Using Euclidean distance on non‑normalized vectors biases results toward longer texts. Either normalize vectors manually or use cosine similarity (which is robust to length).
Pitfall 3 – Stale IVF Index
Continuously inserting new data without rebuilding IVF centroids degrades recall. Periodically run compact or createIndex in Milvus to refresh clusters.
Pitfall 4 – Improper Top‑K
Too small a top‑K (e.g., 1) may miss essential context; too large (e.g., 20) floods the LLM with noise and wastes tokens. A common practice is to retrieve 10 candidates, filter by a similarity threshold (e.g., > 0.7), then pass the top 3‑5 to the LLM.
Pitfall 5 – Missing Metadata Filtering
Searching the entire vector store can return irrelevant domains (e.g., internal reports when the user asks a product question). Apply metadata filters first to narrow the candidate set.
Summary
Embedding is the foundation of RAG : it converts text into vectors that enable mathematical similarity measurement.
Model choice depends on scenario : OpenAI models are easy and cost‑effective; open‑source BGE/E5 are better for local deployment or multilingual needs.
Distance metric matters : use inner product for normalized vectors, otherwise fall back to cosine similarity.
Index structure determines speed : Flat for tiny datasets, HNSW for millions, IVF+PQ for billions.
Vector database ≠ vector index : production systems need metadata filtering, persistence, CRUD, and optional distribution.
LangChain.js abstracts the VectorStore : swapping databases requires only a single line change.
Next up: a hands‑on Milvus tutorial covering installation, million‑scale indexing, and performance tuning.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
