How to Evaluate and Choose Embedding Models for RAG Systems
This article explains why embedding models are the foundation of RAG pipelines, outlines concrete evaluation metrics such as MTEB v2 scores, latency, throughput and cost, compares a range of commercial and open‑source models, and discusses emerging trends like multimodal and long‑context embeddings.
Evaluation Metrics for Embedding Models
Retrieval‑augmented generation (RAG) relies on the embedding model to match user intent with relevant documents. The primary public benchmark is MTEB v2 (Massive Text Embedding Benchmark), which covers >100 multilingual and cross‑language tasks such as retrieval, re‑ranking, classification, clustering and semantic similarity. Scores from MTEB v2 are not directly comparable with v1.
Retrieval‑specific metrics
Detailed retrieval quality measures are described in the articles “RAG: Retrieval Quality Evaluation Metrics” and “RAG: Evaluation Framework” (https://mp.weixin.qq.com/s?__biz=MzIzNTExNzMwNg==∣=2647833150&idx=1&sn=056ab0ad8558df92fb1fd2b303cb1ae3) and (https://mp.weixin.qq.com/s?__biz=MzIzNTExNzMwNg==∣=2647833303&idx=1&sn=136e15c311e3e382cdfb839ff5e71773).
System performance
Latency consists of query‑embedding latency (vectorising the user query) and retrieval latency (vector‑database lookup). Throughput is the number of embedding requests processed per unit time; it becomes critical when batch‑indexing large document collections.
Cost metrics
Indexing cost is a one‑time expense for building the vector index. Query cost is expressed per 1 M tokens for API‑based services (prices referenced from OpenRouter and vendor pricing pages).
Model capability metrics
Context window length determines the maximum text length per embedding:
8192 Token ≈ 6000 words – suitable for medium‑size paragraphs.
32768 Token ≈ 24000 words – can embed whole chapters.
128000 Token ≈ 96000 words – can embed full contracts or research papers.
Longer windows reduce chunk‑boundary loss but may dilute relevance signals; the optimal length depends on document structure. Multilingual vs. cross‑language retrieval :
Multilingual retrieval – the model retrieves within the same language (e.g., Chinese query → Chinese docs).
Cross‑language retrieval – the model aligns vector spaces across languages (e.g., Chinese query → English docs).
Multimodal support maps text, images, audio and video into a unified vector space, enabling cross‑modal retrieval such as image‑to‑text or audio‑to‑document search.
Evaluation process
Public benchmark scores provide a reference, but final model selection should be based on evaluations run on the target dataset.
Embedding Model Selection
Main models
Gemini Embedding 001 – commercial API with the highest Chinese‑English retrieval accuracy, supports 100+ languages, can be reduced to 768 dimensions via Matryoshka, cost ≈ $0.075 per 1 M tokens, API‑only, tightly integrated with Google Cloud. Gemini Embedding 2 – Google’s first native multimodal embedding model; maps text, image, audio and video into a 3072‑dimensional space, 8192‑token context, adjustable output dimension. Recommended for GCP users needing top‑tier API accuracy or multimodal capability. Qwen3‑Embedding‑8B – open‑source multilingual model with decode‑only architecture and bidirectional attention; 32 K token context, supports 100+ languages and code, output dimension 32–7168, Apache‑2.0 license. Suitable when GPU resources are available and full infrastructure control is required. Microsoft Harrier‑OSS‑v1 – three MIT‑licensed decoder models (27 B, 0.6 B, 270 M) all with 32768‑token context. Smaller variants achieve higher quality than similarly sized competitors via knowledge distillation. Fits multilingual retrieval with ample compute or lightweight deployments. Voyage‑3.1‑large – cost‑effective at $0.05 per 1 M tokens. Voyage 4 – first family‑compatible vector space; query endpoint voyage-4-lite costs $0.02 per 1 M tokens and offers specialised models for law, finance, code and multilingual use‑cases. BGE‑M3 – MIT‑licensed hybrid model that outputs dense and sparse vectors in a single inference, eliminating a separate BM25 index. 568 M parameters run on a single GPU, supports quantisation, requires a vector database with native multi‑vector support (e.g., Qdrant, Weaviate). Cohere Embed v4 – commercial API with 128 K token context, robust to noisy data (OCR, handwritten text). Provides VPC and on‑prem deployment for compliance; retrieval capability is weaker and may need pairing with Cohere Rerank. text‑embedding‑3‑large – widely deployed commercial model; supported by most vector databases and RAG frameworks, SLA‑backed, 8192‑token context, up to 3072 dimensions. A “small” variant reduces cost for budget‑constrained scenarios. Nomic Embed v1.5 – fully open weights (Apache‑2.0), 137 M parameters, max 768 dimensions, lightweight; multilingual ability is limited and retrieval accuracy trails larger models. Suitable for transparent, low‑resource English‑only use‑cases.
Scenario‑based guidance
Model choice depends on language coverage, multimodal requirements, cost constraints, infrastructure control and domain‑specific performance. The article provides a decision matrix (image omitted) that maps typical scenarios to the models listed above.
Future Trends
Multimodal embedding
Gemini Embedding 2 marks the shift toward multimodal embeddings, unifying text, image, audio and video in a single vector space and reducing system complexity.
Long‑context embedding
Models with 128 K token context (Cohere Embed v4) and 32 K token context (Qwen3, Voyage, Harrier) enable larger chunks but can suffer signal dilution; short‑chunk embeddings combined with re‑ranking may outperform single long‑chunk embeddings for fine‑grained queries. Hybrid parent‑child retrieval combines fine‑grained 256–512 Token embeddings for precise matching with parent‑level chunks that provide richer context to downstream LLMs.
Domain‑specific embeddings
Legal, medical and financial domains benefit from fine‑tuned models. Parameter‑efficient methods such as LoRA lower the cost of adapting general models to specialized vocabularies.
Vector compression
Advances in vector compression reduce storage costs for billion‑scale corpora, addressing a major bottleneck in large‑scale RAG deployments.
Conclusion
Embedding models form the foundation of any RAG system; retrieval quality determines the effectiveness of downstream prompt engineering, re‑ranking and agent orchestration. No model is universally optimal—selection must match current constraints. Open‑source models have reached parity with commercial APIs on benchmark scores, multimodal embeddings are entering production, and vector compression dramatically lowers storage costs. Leaderboard numbers are based on external data; performance must be validated on the target workload.
Practical checklist before selecting a model
Verify the latest MTEB official leaderboard data (https://huggingface.co/spaces/mteb/leaderboard).
Check current pricing in each model’s official documentation.
Run evaluations on your own dataset.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineer Programming
In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
