Embedding Explained: How Vectorization Turns Text into Numbers for RAG

This article walks through why traditional keyword matching fails for RAG, explains the evolution from one‑hot encoding to Word2Vec and BERT, details sentence‑level embeddings and similarity metrics, compares leading Chinese and multilingual embedding models using the C‑MTEB benchmark, and provides practical LangChain code, deployment tips, and common pitfalls.

AI Architect Hub
AI Architect Hub
AI Architect Hub
Embedding Explained: How Vectorization Turns Text into Numbers for RAG

Why AI Can "Read" Your Question

Traditional keyword search (e.g., Elasticsearch) cannot match a policy sentence like "本产品自签收之日起7日内支持无理由退货" with a user query "买完不想要了能退吗" because they share no characters. Embedding models convert both texts into vectors and compute similarity (e.g., 0.92 cosine), enabling accurate matching.

Core Concept: What Embedding Actually Is

2.1 From One‑Hot to Embedding

One‑hot encoding creates a 10,000‑dimensional vector for each word, treating all words as equally distant. Word2Vec learns static vectors from co‑occurrence statistics, making semantically similar words close (e.g., "king" – "man" + "woman" ≈ "queen"). BERT adds a self‑attention transformer that generates context‑dependent vectors, so the same word has different vectors in different sentences.

One‑hot: like giving every person a paper with an ID number and asking them to find their hometown – no useful information. Word2Vec: like assigning a zodiac sign; similar signs group together. BERT: like giving each person a GPS + social network to locate them in real time.

2.2 Sentence‑Level Embedding

RAG uses sentence or paragraph embeddings rather than word vectors. Two common approaches:

Use the [CLS] token output as the sentence vector.

Mean‑pool or max‑pool all token vectors; mean‑pool is generally more stable.

2.3 Vector Similarity

After vectors are obtained, similarity can be measured by:

Cosine similarity: cos(θ) = A·B/(|A||B|) – most common for RAG because it cares about direction, not magnitude.

Dot product: A·B – sensitive to vector length.

Euclidean distance: ||A‑B|| – smaller distance means more similar.

Model Comparison: Who Is Strongest in 2025?

Benchmark source: C‑MTEB (Chinese Massive Text Embedding Benchmark) covering retrieval, classification, clustering, etc.

BGE‑large‑zh – 335M parameters, 1024‑dim, Chinese MTEB 63.2, latency 95 ms, open‑source local deployment.

BGE‑M3 – 567M parameters, 1024‑dim, Chinese MTEB 64.8, latency 50 ms, supports 100+ languages.

m3e‑base – 220M parameters, 768‑dim, Chinese MTEB 60.5, latency 35 ms, very low cost.

text‑embedding‑3‑large (OpenAI) – 3072‑dim, Chinese MTEB 62.5, latency 220 ms, API‑only, price‑high.

text‑embedding‑3‑small (OpenAI) – 1536‑dim, Chinese MTEB 58.5, latency 80 ms, cheaper than large.

jina‑embeddings‑v3 – 520M parameters, 1024‑dim, Chinese MTEB 59.8, latency 130 ms, open‑source/API.

Key takeaways :

For pure Chinese use, BGE‑large‑zh gives the best accuracy‑latency trade‑off (78 % top‑5, 95 ms).

BGE‑M3 excels in multilingual, long‑text, and hybrid retrieval scenarios.

OpenAI’s Chinese performance is comparable to BGE‑large‑zh but much more expensive.

m3e‑base is a “small cannon” – 71 % accuracy with only 35 ms latency, ideal for latency‑sensitive workloads.

Why Chinese Models Beat OpenAI

The decisive factor is training data: BGE and m3e are trained on massive Chinese‑language internet corpora, while OpenAI’s models contain a relatively small proportion of Chinese text. As with a native speaker taking a Chinese exam, the data‑rich models have a decisive advantage.

Pitfall Guide: Selection and Real‑World Lessons

4.1 Dimension Is Not Always Better

Higher dimensions increase storage and compute cost. In Chinese tests, a 1024‑dim BGE‑large‑zh outperforms the 3072‑dim text‑embedding‑3‑large. OpenAI’s Matryoshka trick can truncate dimensions with only 1‑2 % accuracy loss, cutting cost by two‑thirds.

4.2 Vector Dimension Must Match the Database

ChromaDB defaults to 1536 dimensions; storing 1024‑dim BGE vectors without adjusting the collection leads to errors. Create the collection with the correct dimension=1024 parameter.

collection = client.create_collection(
    name="my_collection",
    metadata={"hnsw:space": "cosine"},
    dimension=1024  # must match embedding dimension
)

4.3 Chinese Tokenization Matters

BERT’s Chinese tokenizer splits characters, so "深度学习" becomes four tokens. Solutions:

Use Chinese‑optimized models such as BGE‑M3 or m3e‑base.

Pre‑process with a Chinese word‑segmenter (e.g., jieba) to keep phrases intact.

4.4 Multilingual Model Choice

Pure Chinese knowledge bases: choose BGE or m3e. Mixed Chinese‑English or cross‑border e‑commerce: prefer BGE‑M3 (100+ language support) or text‑embedding‑3‑large (strong multilingual generalisation).

4.5 Balancing batch_size and max_length

max_length

controls the maximum tokens per request:

Too short → truncation, loss of information.

Too long → high memory usage, slower inference.

Practical settings:

Document chunks: max_length=512.

User query: max_length=128.

Batch size: 16‑32 (larger may cause OOM).

Code Walk‑Through: LangChain with Different Embedding Models

5.1 Environment Setup

pip install langchain langchain-community langchain-huggingface
pip install sentence-transformers torch
pip install chromadb

5.2 Load Open‑Source BGE Model

from langchain_huggingface import HuggingFaceEmbeddings

bge_embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-zh-v1.5",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True}
)

bge_m3_embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True}
)

text = "如何申请产品退货退款?"
query_vector = bge_embeddings.embed_query(text)
print(f"向量维度: {len(query_vector)}")
print(f"向量前5个值: {query_vector[:5]}")

5.3 Load Light‑Weight m3e‑base

from langchain_huggingface import HuggingFaceEmbeddings

m3e_embeddings = HuggingFaceEmbeddings(
    model_name="moka-ai/m3e-base",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True}
)

docs = [
    "本产品自签收之日起7日内支持无理由退货",
    "退货时请保持商品完好,附带发票",
    "运费由消费者承担"
]

doc_vectors = m3e_embeddings.embed_documents(docs)
print(f"处理了 {len(doc_vectors)} 个文档")

5.4 Use OpenAI API

from langchain_openai import OpenAIEmbeddings

small_embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key="your-api-key"
)

large_embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large",
    api_key="your-api-key"
)

5.5 Full RAG Pipeline: Embedding + ChromaDB

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-zh-v1.5",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True}
)

texts = [
    "本产品自签收之日起7日内支持无理由退货",
    "退货时请保持商品完好,附带发票",
    "运费由消费者承担"
]

db = Chroma.from_texts(
    texts=texts,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="product_policy"
)

query = "买完不想要了能退吗?"
results = db.similarity_search(query, k=2)
print("检索结果:")
for i, doc in enumerate(results):
    print(f"{i+1}. {doc.page_content} (相似度: {1 - doc.metadata.get('distance', 0):.3f})")

5.6 Model Speed Test

import time
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_openai import OpenAIEmbeddings

models = {
    "BGE-large-zh": HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5"),
    "m3e-base": HuggingFaceEmbeddings(model_name="moka-ai/m3e-base"),
    "text-embedding-3-small": OpenAIEmbeddings(model="text-embedding-3-small")
}

test_queries = ["如何申请退货?"] * 10
for name, embed_model in models.items():
    start = time.time()
    for q in test_queries:
        embed_model.embed_query(q)
    elapsed = time.time() - start
    print(f"{name}: {elapsed:.3f}s total ({elapsed/10*1000:.1f}ms per query)")

Best‑Practice Summary

6.1 Model Selection Decision Tree

你的场景是什么?
├── 纯中文 + 生产环境 → 推荐:BGE‑M3 或 BGE‑large‑zh
├── 中文 + 多语言 → 推荐:BGE‑M3
├── 延迟敏感 + 精度要求不高 → 推荐:m3e‑base
├── 英文为主 + 不差钱 → 推荐:text‑embedding‑3‑large
└── 对隐私有要求(数据不能出境) → 推荐:BGE‑M3 / m3e‑base(本地部署)

6.2 Performance Optimization Tips

GPU acceleration: HuggingFace embeddings run 10‑50× faster on GPU than CPU.

Batch processing: use embed_documents instead of looping embed_query for a 5‑10× speed boost.

Vector compression: apply OpenAI’s Matryoshka technique to truncate dimensions and cut storage cost by >60 %.

Cache vectors: index static documents once and reuse for subsequent queries.

6.3 Common Issues Checklist

Retrieval results all wrong → likely dimension mismatch; verify collection dimension.

Slow speed → probably running on CPU; switch to GPU or a smaller model.

Similarity scores always >0.9 → missing normalization; set normalize_embeddings=True.

Poor performance on long documents → max_length too small; increase it or use a model with larger context.

Thought Questions

If 99 % of your knowledge base is Chinese and 1 % English, would you choose BGE‑M3 or BGE‑large‑zh? Explain.

Two vectors have cosine similarity 0.95 but a large Euclidean distance. What could cause this?

When would you prefer a lightweight embedding model like m3e‑base over a stronger model like BGE‑M3?

Next Episode Preview

In the next level we will dive into the retrieval stage of RAG – how to quickly find the most relevant vectors from a massive index.

Preview title: "Which Vector Search Engine Is Best? ANN Algorithms Explained + Hands‑On Comparison".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LangChainRAGEmbeddingVectorizationChinese NLPModel Benchmark
AI Architect Hub
Written by

AI Architect Hub

Discuss AI and architecture; a ten-year veteran of major tech companies now transitioning to AI and continuing the journey.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.