25 min read

Vector Database Fundamentals: Embedding, Similarity Search, and Index Structures Explained in One Go

This article walks through the complete workflow of turning split text into high‑dimensional vectors, choosing the right embedding model, selecting an appropriate similarity metric, comparing index structures such as Flat, IVF, HNSW and PQ, and finally picking a vector database and integrating it with LangChain.js for production‑grade RAG pipelines.

James' Growth Diary

Apr 26, 2026

Vector Database Fundamentals: Embedding, Similarity Search, and Index Structures Explained in One Go

01 Embedding: The Bridge from Text to Vectors

Embedding converts a piece of text into a fixed‑length dense vector so that semantically similar sentences are close in Euclidean space. For example, the OpenAI text-embedding-3-small model outputs a 1536‑dimensional vector for the sentence "LangChain 是一个 AI 应用开发框架".

"LangChain 是一个 AI 应用开发框架"
    ↓ Embedding Model
[0.0123, -0.0456, 0.0789, ..., 0.0321] // 1536 floats

In LangChain.js the API provides two methods: embedQuery – generates a single vector for a query. embedDocuments – generates a batch of vectors for storage.

Mixing them can degrade retrieval quality because some models treat queries and documents differently.

import { OpenAIEmbeddings } from "@langchain/openai";
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small", dimensions: 1536 });
// Query embedding
const qVec = await embeddings.embedQuery("LangChain 是一个 AI 应用开发框架");
// Batch embedding for storage
const docVecs = await embeddings.embedDocuments([
  "LangChain 是一个 AI 应用开发框架",
  "向量数据库用于存储和检索向量",
  "RAG 是检索增强生成的缩写"
]);

02 Embedding Model Selection: Cost, Quality, and Speed Trade‑offs

There is no universally best model; the choice depends on language, budget, and performance requirements.

Commercial models (OpenAI) text-embedding-3-small – 1536 dimensions (adjustable), 8191 max tokens, best cost‑performance for most cases. text-embedding-3-large – 3072 dimensions, double cost, higher accuracy. text-embedding-ada-002 – legacy, not recommended for new projects.

Open‑source models BGE-large-zh-v1.5 – 1024 dimensions, top Chinese open‑source model. BGE-m3 – 1024 dimensions, multilingual, multi‑granularity. E5-large-v2 – 1024 dimensions, excellent English performance. GTE-large – 1024 dimensions, from Tongyi Qianwen team.

Decision flow (textual representation):

Is your data primarily Chinese?
├─ Yes → Budget sufficient?
│   ├─ Yes → text-embedding-3-small (simple & cheap)
│   └─ No  → BGE-large-zh-v1.5 / BGE-m3 (self‑hosted)
└─ No → Multilingual?
    ├─ Yes → BGE-m3 / text-embedding-3-small
    └─ No (pure English) → E5-large-v2 / text-embedding-3-small

Example using Ollama with a local BGE model:

import { OllamaEmbeddings } from "@langchain/ollama";
const embeddings = new OllamaEmbeddings({ model: "bge-large-zh-v1.5", baseUrl: "http://localhost:11434" });
const vec = await embeddings.embedQuery("向量数据库原理");
console.log(vec.length); // 1024

03 Similarity Calculation: Cosine, Euclidean, and Inner Product

After vectors are generated, retrieval reduces to finding the nearest vectors. The three common distance measures differ:

Cosine Similarity (range [-1,1]): measures angle, ignores magnitude. When vectors are already normalized, cosine similarity equals inner product.

Euclidean Distance (L2) : measures straight‑line distance; sensitive to vector length.

Inner Product (Dot Product) : fastest when vectors are normalized; otherwise behaves like scaled cosine.

// Cosine similarity example (TypeScript)
function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

Guideline for choosing a metric:

Is the embedding output normalized?
├─ Yes → Use Inner Product (fast, same result as cosine)
├─ Not sure → Use Cosine Similarity (safe default)
└─ No (raw vectors) → Either normalize first or use Euclidean Distance

04 Vector Index Structures: From Brute‑Force to Millisecond Retrieval

Scanning every vector (Flat) is O(n) and becomes impractical beyond a few hundred thousand vectors. Indexes accelerate search.

Flat (Brute‑Force)

Accuracy: 100 % (exact nearest neighbor)

Speed: O(n)

Suitable for <10 k records or when exact results are required.

IVF (Inverted File Index)

Clusters vectors into K centroids; at query time only the nearest clusters are scanned.

Accuracy: Approximate (may miss some good results)

Speed: O(K + nprobe × n/K)

Suitable for millions of vectors with moderate recall requirements.

HNSW (Hierarchical Navigable Small World)

Graph‑based index providing logarithmic search time with high recall.

Accuracy: ★★★★★ (≈95 %+ recall)

Speed: ★★★★★ (millisecond latency)

Memory: higher due to graph storage.

Suitable for 10 k – 50 M vectors where quality matters.

PQ (Product Quantization)

Compresses high‑dimensional vectors into low‑precision codes, trading accuracy for storage and speed.

Compression: up to 768× (e.g., 1536‑dim float → 8 bytes)

Accuracy: lowest among the four

Speed: very fast

Memory: minimal

Suitable for >10 M vectors when recall can be relaxed.

Summary of index trade‑offs:

Flat – exact, slow, good for tiny datasets.

IVF – approximate, moderate speed, works for millions.

HNSW – high accuracy & speed, higher memory, suitable up to tens of millions.

PQ – lowest accuracy, highest compression, best for very large collections.

05 Vector Database Selection: One‑Page Comparison

A vector database combines an index engine, metadata store, filtering, CRUD, and optional distributed capabilities.

Milvus – open‑source, self‑hosted, supports distributed deployment, strong metadata filtering, handles billions of vectors, medium learning curve.

Pinecone – fully managed SaaS, very low operational overhead, strong metadata filtering, handles billions, extremely easy to start.

Chroma – open‑source, lightweight, in‑memory, basic metadata filtering, suitable for millions of vectors, very low entry barrier.

FAISS – index library only, no metadata filtering, suitable for offline experiments, low learning curve.

Weaviate – open‑source, self‑hosted, strong metadata filtering, supports hybrid vector + keyword search, medium learning curve.

Selection flow (textual):

You are building what?
├─ Quick prototype / personal project / < 1M records → Chroma (in‑memory, instant start)
├─ Production, no ops wanted → Pinecone (managed, pay‑as‑you‑go)
├─ Production, large data, self‑control → Milvus (full feature set, active community)
├─ Pure experiment, no persistence needed → FAISS (flexible index library)
└─ Need hybrid vector + keyword search → Weaviate / Milvus

06 LangChain.js VectorStore Practical Guide

Using Chroma for Fast Prototyping

import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { Document } from "@langchain/core/documents";

const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const docs = [
  new Document({ pageContent: "LangChain 是一个用于构建 AI 应用的开发框架", metadata: { source: "docs", category: "framework" } }),
  new Document({ pageContent: "向量数据库可以存储和检索高维向量数据", metadata: { source: "docs", category: "database" } }),
  new Document({ pageContent: "RAG 通过检索外部知识来增强大模型的回答质量", metadata: { source: "blog", category: "rag" } }),
  new Document({ pageContent: "Embedding 模型将文本转换为稠密向量表示", metadata: { source: "docs", category: "embedding" } })
];

const vectorStore = await Chroma.fromDocuments(docs, embeddings, {
  collectionName: "my-collection",
  url: "http://localhost:8000"
});

const results = await vectorStore.similaritySearch("什么是向量数据库？", 2);
console.log(results);

Metadata‑filtered Search

// Search only documents where source="docs"
const filtered = await vectorStore.similaritySearch("什么是向量数据库？", 2, { source: "docs" });
// Search with similarity scores
const withScores = await vectorStore.similaritySearchWithScore("什么是向量数据库？", 3);
for (const [doc, score] of withScores) {
  console.log(`[${score.toFixed(4)}] ${doc.pageContent}`);
}

FAISS for Local Experiments

import { OpenAIEmbeddings } from "@langchain/openai";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { Document } from "@langchain/core/documents";

const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const docs = [
  new Document({ pageContent: "TypeScript 是 JavaScript 的超集" }),
  new Document({ pageContent: "Python 是一门动态类型语言" }),
  new Document({ pageContent: "Rust 以内存安全著称" })
];

const vectorStore = await FaissStore.fromDocuments(docs, embeddings);
const results = await vectorStore.similaritySearch("静态类型语言", 2);
console.log(results[0].pageContent); // TypeScript 是 JavaScript 的超集
await vectorStore.save("./faiss-index");
const loaded = await FaissStore.load("./faiss-index", embeddings);

Milvus for Production

import { OpenAIEmbeddings } from "@langchain/openai";
import { Milvus } from "@langchain/community/vectorstores/milvus";
import { Document } from "@langchain/core/documents";

const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const docs = [
  new Document({ pageContent: "HNSW 是目前最流行的向量索引算法", metadata: { topic: "index", level: "advanced" } }),
  new Document({ pageContent: "IVF 通过聚类减少搜索范围", metadata: { topic: "index", level: "intermediate" } })
];

const vectorStore = await Milvus.fromDocuments(docs, embeddings, {
  collectionName: "langchain_demo",
  url: "http://localhost:19530",
  indexCreateParams: {
    index_type: "HNSW",
    metric_type: "IP",
    params: JSON.stringify({ M: 16, efConstruction: 256 })
  },
  searchParams: { ef: 128 }
});

const results = await vectorStore.similaritySearch("向量索引算法", 2);
console.log(results);

07 Full RAG Retrieval Pipeline

The end‑to‑end process consists of an offline indexing stage (split → embedDocuments → store) and an online query stage (embedQuery → ANN search → top‑K retrieval → prompt construction → LLM generation). Each stage can fail: poor splitting, wrong model, unsuitable index, or too small top‑K.

┌─────────────────── Offline Indexing ────────────────────┐
│ Original documents → TextSplitter → chunks → embedDocuments →
│ vectors → write to VectorStore (index + metadata)          │
└───────────────────────────────────────────────────────────┘

┌───────────────────── Online Query ──────────────────────┐
│ User question → embedQuery → query vector → ANN search →
│ retrieve top‑K chunks → assemble prompt → LLM generates answer │
└───────────────────────────────────────────────────────────┘

08 Common Pitfalls and Remedies

Pitfall 1: Dimension Mismatch

Switching embedding models without re‑embedding existing data leads to a mismatch (e.g., stored vectors 1536‑dim, query vectors 1024‑dim) and runtime errors.

// ❌ Stored with text-embedding-3-small (1536)
// ❌ Query using BGE (1024) → error
// ✅ Fix: Re‑embed all data and rebuild the index.

Pitfall 2: Normalization Issues

Using Euclidean distance on non‑normalized vectors biases results toward longer texts.

// Manual normalization helper
function normalize(vec) {
  const norm = Math.sqrt(vec.reduce((s, v) => s + v * v, 0));
  return vec.map(v => v / norm);
}
// ✅ Either use cosine similarity (no need to normalize) or normalize then use inner product.

Pitfall 3: Stale IVF Index

Continuously inserting new vectors without rebuilding IVF centroids degrades recall. Periodically run compact or createIndex in Milvus.

Pitfall 4: Improper Top‑K

Top‑K too small may miss essential information; too large floods the LLM with noise. Recommended: retrieve Top‑10, filter by score (e.g., cosine > 0.7), then pass the best 3‑5 to the LLM.

Pitfall 5: Missing Metadata Filtering

Searching the entire corpus can return irrelevant domains (e.g., product docs vs. internal reports). Apply metadata filters before vector search.

const results = await vectorStore.similaritySearch(
  "产品定价策略",
  5,
  { category: "product-docs" } // filter only product documents
);

Summary

Embedding is the foundation of RAG : it converts text into vectors that enable mathematical similarity measurement.

Model choice depends on scenario : OpenAI embeddings are easy and cost‑effective; open‑source BGE/E5 suit local deployment and custom needs.

Distance metric drives retrieval quality : use inner product for normalized vectors, otherwise cosine similarity; avoid Euclidean on raw vectors.

Index structure determines speed : Flat for tiny datasets, HNSW for millions with high recall, IVF+PQ for billions where storage matters.

Vector database ≠ just an index : production systems need metadata filtering, CRUD, persistence, and possibly distributed capabilities.

LangChain.js abstracts the VectorStore : swapping databases requires only a single initialization change, keeping business logic stable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

indexing LangChain RAG vector databases similarity search embeddings

Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.