From Bag‑of‑Words to Semantics: How Embeddings Turn Meaning into Numbers (Part 2)

The article explains how embedding techniques encode semantic information into numeric vectors, covering Word2Vec and GloVe fundamentals, BERT anisotropy, SimCSE contrastive learning, alignment and uniformity metrics, ANN index structures such as HNSW, IVF and PQ, Matryoshka representation learning, practical deployment challenges, and evaluation best practices.

AI Engineer Programming
AI Engineer Programming
AI Engineer Programming
From Bag‑of‑Words to Semantics: How Embeddings Turn Meaning into Numbers (Part 2)

Embedding Goal

Embedding aims to encode semantic information into numeric vectors so that semantically similar content is close in vector space.

Word2Vec

CBOW vs. Skip‑gram

CBOW predicts the center word from its context (fast, good for high‑frequency words); Skip‑gram predicts surrounding words from the center word (better for low‑frequency words and more widely used).

The training objective maximises the corpus‑wide log‑likelihood, with each conditional probability computed by Softmax. Because the vocabulary size W can be hundreds of thousands, exact Softmax is expensive, leading to two approximations:

Hierarchical Softmax : replaces Softmax with a Huffman tree, reducing complexity from O(W) to O(log W) but is less GPU‑friendly.

Negative Sampling : samples k negative words (k≈5–20) per update, turning the multi‑class problem into k + 1 binary classifications. Negative words are sampled proportionally to the 3/4 power of their frequency to favour rare words.

GloVe

GloVe builds a global co‑occurrence matrix counting how often word i appears near word j, then fits the log‑co‑occurrence probability with separate target and context vectors and a weighting function that down‑weights very frequent pairs, avoiding domination by stop words. Like Word2Vec, GloVe yields static vectors that cannot distinguish polysemy.

Anisotropy in BERT

Pre‑trained BERT vectors suffer from anisotropy: vectors occupy a narrow cone on the hypersphere, causing cosine similarities between random sentence embeddings to be >0.9 (sometimes >0.99). This stems from the pre‑training objectives (MLM, NSP) not enforcing that similar sentences be close and dissimilar sentences be far.

SimCSE

SimCSE proposes a simple unsupervised contrastive method: feed the same sentence twice through the encoder with different Dropout masks, treating the two outputs as a positive pair and all other batch sentences as negatives. Removing Dropout collapses representations to a single point.

The supervised variant uses NLI entailment pairs as positives and contradiction pairs as hard negatives.

The InfoNCE loss is:

loss = -log( exp(sim(z_i, z_i^+) / τ) / Σ_j exp(sim(z_i, z_j) / τ) )

where τ (typically 0.05) controls distribution sharpness; larger batch sizes provide more negatives, which is crucial for learning meaningful semantic boundaries.

Alignment & Uniformity

Wang & Isola (2020) define two geometric metrics for contrastive learning:

Alignment : average distance of positive pairs (smaller is better).

Uniformity : how uniformly vectors spread on the hypersphere (smaller is better).

ANN Index Algorithms

HNSW (Hierarchical Navigable Small World)

Inspired by skip‑lists and small‑world networks, HNSW builds a multi‑layer graph where each node appears in a random subset of layers (probability decays exponentially). Construction starts from the top layer and greedily connects to nearest neighbours.

Key parameters: M: number of bi‑directional edges per node (typically 16–64). Larger M improves recall but uses more memory. ef_construction: candidate set size during graph building; higher values improve graph quality at the cost of longer build time. ef_search: candidate set size at query time; balances recall against latency.

HNSW supports incremental insertion; deletions are handled by logical marking and periodic rebuilding.

IVF (Inverted File)

IVF clusters all vectors with k‑means, builds an inverted list from centroids to member vectors, and at query time first finds the nearest centroids before performing exact search inside those clusters, dramatically reducing comparisons.

Key parameter nprobe controls how many centroids are scanned; larger nprobe raises recall but also latency. IVF requires a training phase for k‑means, which can be costly on very large datasets.

Product Quantization (PQ)

PQ splits a high‑dimensional vector into m sub‑vectors, quantises each sub‑vector independently with its own k‑means codebook, and stores only the centroid IDs. At query time, a pre‑computed lookup table estimates distances quickly.

IVF‑PQ Combination

In production, IVF is often used for coarse filtering and PQ for compressing candidate vectors. This hybrid achieves sub‑100 ms response on billion‑scale datasets (FAISS default configuration).

Matryoshka Representation Learning (MRL)

MRL trains a single model whose prefix dimensions already carry semantic meaning, allowing vectors to be truncated (e.g., 768 → 256 → 128 → 64) to trade latency, storage and bandwidth without retraining separate models.

The loss sums the losses computed at multiple truncation points, typically with equal weighting, though practitioners may weight low‑dimensional losses higher to strengthen coarse representations.

OpenAI’s text‑embedding‑3‑small and text‑embedding‑3‑large (Jan 2024) use MRL, achieving up to 14× size reduction with no loss in ImageNet classification accuracy and up to 14× speed‑up in large‑scale retrieval.

Practical Deployment Issues

Query‑Document Asymmetry

User queries are short while documents are long, leading to “thin” query vectors that are systematically undervalued by Bi‑Encoder similarity.

Solution: instruction‑tuned embeddings add role‑specific prefixes, e.g.:

Query side:  "query: How to improve RAG recall?"
Document side: "passage: Methods to improve RAG recall include..."

Models such as GTE‑Qwen2 and BGE‑M3 support this asymmetric encoding.

Domain Mismatch

General‑purpose embeddings excel on benchmarks like MTEB but degrade on vertical domains (finance, medical, legal) because domain‑specific terminology is scarce in pre‑training data.

Common remedies:

LLM‑generated synthetic queries per document (3–5 per block) forming <query, positive_passage> pairs.

Hard negative mining with BM25 top‑20 results that are not truly relevant.

Fine‑tuning on domain data using MultipleNegativesRankingLoss (5 k–20 k pairs usually suffice).

Filtering vs. ANN

Metadata filters (e.g., “year = 2026”) can be applied before or after ANN:

Pre‑filter : reduces candidate set but may leave too few vectors for ANN to be effective.

Post‑filter : ANN retrieves ef × k candidates, then filters; strict filters can leave far fewer than k results.

Best practice: choose filter dimensions with low cardinality (< 1 000) for partitioned indexes; high‑cardinality dimensions (user_id, document_id) are better handled by post‑filtering or payload‑based pruning in HNSW‑like systems (Qdrant, Weaviate).

Index Update Real‑time

HNSW allows incremental inserts but deletions are logical only; frequent updates degrade graph quality.

Typical solution: maintain a fresh flat index for recent inserts and a full HNSW for the bulk; merge them during low‑traffic windows (blue‑green switch) to keep latency stable.

Memory Cost

On‑disk mode stores full‑precision vectors on SSD while keeping only quantised indexes (PQ codebooks, inverted lists) in RAM. Queries first use the RAM index to shortlist candidates, then fetch raw vectors from disk for exact re‑ranking. DiskANN follows this pattern; Google’s ScaNN adds anisotropic quantisation for higher accuracy.

Embedding Evaluation

Standard academic metrics: recall@K and NDCG@K on benchmarks such as BEIR and MTEB.

Production‑grade monitoring adds:

Latency distribution (p50, p95, p99) – p99 spikes often come from worst‑case HNSW paths or PQ precision loss.

Recall‑to‑business‑metric gap – a 0.90→0.95 recall lift may only improve downstream LLM answer accuracy by 1–2%.

Distribution drift – periodic (e.g., monthly) re‑evaluation to detect embedding quality decay as query/document vocabularies evolve.

Golden set regression – a curated set of 50–200 challenging queries (synonyms, multi‑hop reasoning, negation) that must pass after any model or index change.

Continuous iteration – integrate the golden set and an end‑to‑end LLM‑as‑judge pipeline so embedding quality becomes a quantifiable, monitorable KPI.

Multi‑Vector Models

Single‑vector embeddings have a theoretical expressiveness ceiling for complex logical queries (e.g., “dynamic programming AND graph algorithms”).

Late‑interaction models like ColBERT compute a MaxSim score between each query token and the most similar document token, summing these scores. Tokens can be pre‑computed offline, enabling deployment, but storage cost is high (one vector per token) and latency increases.

Decoder‑based embeddings (GritLM, E5‑Mistral, Qwen3‑Embedding) remove causal masks and use novel pooling strategies to produce strong representations.

Recommendations

Model selection: use BGE‑M3 or text‑embedding‑3‑large for general use; fine‑tune for vertical domains when the gap is large.

Index selection: HNSW for tens of millions of vectors when memory permits; IVF‑PQ or DiskANN for billions; hybrid IVF + HNSW is the engineering standard.

Dimension optimisation: leverage MRL‑enabled models with 256–512 dimensions for ANN coarse recall and full dimensions for final re‑ranking.

Retrieval strategy: adopt hybrid search (dense + BM25) with a reranker to cover each method’s blind spots.

Monitoring: maintain a golden set and an LLM‑judge pipeline to turn embedding quality into a measurable, continuously improving metric.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

HNSWEmbeddingBERTANNWord2VecSimCSEIVFMRL
AI Engineer Programming
Written by

AI Engineer Programming

In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.