From Bag‑of‑Words to Semantic Vectors: Understanding Embeddings and Similarity Search (Part 1)
The article explains how diverse data can be represented as high‑dimensional vectors, describes exact and approximate nearest‑neighbor search, explores vector quantization, product quantization, locality‑sensitive hashing, and HNSW graphs, and analyzes their speed, accuracy, and memory trade‑offs for large‑scale similarity retrieval.
How Computers Recognize Similarity
By representing objects as high‑dimensional vectors, spatial distance directly reflects feature similarity. Two Golden Retrievers map to nearby points, while a Golden Retriever and a Chihuahua map far apart. This principle underlies image‑based search, music recommendation, and semantic understanding in large language models.
1 Vectorization and Nearest‑Neighbor Search Basics
1.1 Everything Can Be Vectorized
Images, product attributes, document semantics, and user interests can be compressed by models (e.g., CNN for images, Transformer for text) into fixed‑length floating‑point arrays called embeddings . Mapping heterogeneous data to a common vector space makes them comparable.
1.2 Nearest‑Neighbor Problem: Geometric Similarity Search
Given a query vector, the task is to find the closest (or top‑K closest) vectors among massive candidates – the Nearest Neighbor Search (NNS) . Two common distance measures are:
Euclidean distance : straight‑line distance, suitable for dense image features.
Cosine similarity : angle between vectors; smaller angle means higher similarity and is insensitive to magnitude, commonly used for text.
1.3 Brute‑Force Search: Exact but Unscalable
Brute‑force computes the distance between the query and every stored vector, guaranteeing 100 % precision but with linear time complexity. For 100 million 128‑dimensional vectors, each query requires about 12.8 billion floating‑point operations, which is infeasible in production.
1.4 Approximate Nearest Neighbor (ANN): Trade Accuracy for Speed
ANN shrinks the search space in advance, accepting a slight loss of accuracy for orders‑of‑magnitude speed gains. The simplest optimization is clustering (e.g., K‑Means) to create centroids; at query time the query compares only with a few centroids, selects the nearest centroid, and performs an exact search inside that cluster.
Speed‑Accuracy‑Memory Triangle : Approximate search speeds retrieval but can cause memory explosion for high‑dimensional data.
2 Vector Quantization and the Curse of Dimensionality
2.1 Basic Model of Vector Quantization
Vector Quantization (VQ) approximates each original vector by the nearest codeword from a finite codebook. Only the codeword index (a few bits) is stored, discarding the full floating‑point values and achieving high compression.
2.2 Dimensionality Curse: Exponential Growth of Codebook Size
Rate‑distortion theory states that for uniformly distributed vectors, achieving a fixed average distortion requires a codebook size that grows exponentially with dimensionality. Controlling error per dimension quickly leads to a codebook size far exceeding the number of atoms in the universe.
3 Product Quantization (PQ) – A Divide‑and‑Conquer Strategy
3.1 Core Idea: Subspace Decomposition
Product Quantization splits the original M‑dimensional space into m disjoint subspaces (each of size M/m) and trains an independent small codebook of size k for each subspace. The effective codebook size becomes k^m while storage per vector remains m × log₂k bits.
Example : For 128‑dimensional vectors, using PQ with 8 subspaces and 256 codewords per subspace stores only 8 × 8 = 64 bytes (≈128 KB for 1 M vectors). A full‑precision VQ would need 128 × 4 = 512 bytes per vector.
3.2 Asymmetric Distance Computation (ADC)
During query, PQ builds a distance lookup table between the query subvector and every codeword of each subspace. The approximate distance to any encoded vector is obtained by summing the corresponding table entries – only a few table look‑ups and additions per vector, dramatically faster than full Euclidean computation.
3.3 Statistical Properties of Quantization Error
Assuming zero‑mean, independent sub‑space quantization errors, the expected error is dominated by sub‑space distortion. In practice PQ’s approximate distances retain a high monotonic correlation with true distances. For 128‑dimensional vectors, PQ reduces memory usage by about 98 % while preserving retrieval quality.
4 Locality‑Sensitive Hashing (LSH)
4.1 Design Goal and Collision Probability
LSH constructs a family of hash functions such that the collision probability is a monotonic decreasing function of distance – more similar points are more likely to hash to the same bucket. This converts costly floating‑point geometry into fast bit‑wise operations.
4.2 Random Hyperplane LSH for Cosine Similarity
For cosine similarity, a random unit vector drawn from a standard normal distribution defines a hash function that outputs the sign of the dot product with the input vector. For two unit vectors with angle θ, the collision probability is 1 – θ/π; smaller angles yield higher probability.
4.3 Composite AND‑OR Hashing to Boost Discrimination
Single hash functions have limited discriminative power, so LSH uses composite hashing:
AND combination : concatenate L independent hash functions; collision probability becomes the product, sharpening sensitivity to similarity changes.
OR combination : generate T independent composite hash tables; a candidate is accepted if it collides in any table, improving recall.
By tuning L and T, one can balance recall and filtering efficiency, achieving sub‑linear query time.
5 Hierarchical Navigable Small World Graph (HNSW)
5.1 Inspiration from the Small‑World Phenomenon
Treat each vector as a node that records a few neighbor addresses. Traversing these links allows rapid jumps across the space, analogous to the “six‑degree” theory.
5.2 Construction Mechanism of NSW Graph
Select a random entry node from the current graph.
Perform greedy search to find ef_construction nearest nodes.
Create bidirectional connections with those nearest neighbors.
Optionally prune connections to keep the local graph sparse.
This yields two connection types:
Short intra‑cluster links : connect very close points, ensuring local precision.
Long inter‑cluster links : act as “highways” that let the search quickly jump across distant regions.
5.3 Hierarchical Structure of HNSW
Each node is assigned a maximum level following a geometric distribution; higher levels contain only a few nodes, providing coarse navigation, while lower levels contain all nodes for fine‑grained comparison.
Layer characteristics:
Top layer : very few nodes (single digits), very large connection span, used to quickly locate the target large region.
Middle layers : increasing node count, gradually shorter connection span, guide the search into the correct sub‑region.
Bottom layer (level 0) : all nodes, dense local connections, perform precise comparison and return final results.
5.4 Query Complexity Analysis
Within each layer, greedy search averages about c·log N steps, where N is the number of nodes in that layer. Overall query complexity is sub‑linear; experiments show millisecond‑level latency on datasets ranging from millions to billions of vectors, with recall above 95 %.
5.5 Memory Cost and Optimizations
Each node maintains up to maxM neighbors per layer, so total edge count is roughly maxM × N. For 100 million 128‑dimensional vectors (raw data ≈ 48 GB), an HNSW index may add another 30–50 GB of graph storage.
Common combined indexing strategies: HNSW index: stores only the graph and original vectors; highest memory usage but best recall and speed, suitable for tens of millions of vectors with ample memory. IVF_PQ index: combines IVF clustering with PQ compression; minimal memory, suitable for billions of vectors with moderate accuracy requirements. HNSW_PQ index: builds HNSW on top of PQ‑compressed vectors; reduces memory by an order of magnitude while retaining HNSW’s high recall and fast navigation, ideal for industrial‑scale (> billion) datasets.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineer Programming
In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
