Introduction to Vector Retrieval, Distance Metrics, and Fundamental Algorithms
This article introduces the concept of vector retrieval, outlines its diverse application scenarios, explains common distance metrics for both floating‑point and binary vectors, and surveys fundamental approximate nearest‑neighbor algorithms including tree‑based, graph‑based, quantization, and hashing methods.
1. Vector Retrieval Introduction
With the rapid growth of the Internet, massive unstructured data such as images, text, video, and audio are generated. After extracting feature vectors using AI techniques while respecting privacy, vector retrieval aims to efficiently search these vectors to enable analysis and retrieval of the original data.
1.1 Concept Overview
Vector retrieval focuses on computing similarity between feature vectors and returning the top‑K most similar vectors.
1.2 Application Scenarios
Recommendation systems (ads, "you may like" suggestions)
Image search (e.g., product image search, vehicle retrieval)
Natural language processing (semantic text search and recommendation)
Voice matching and audio retrieval
File deduplication via fingerprinting
Drug discovery searches
Examples include app splash‑screen ads, Taobao's "search by image", and search‑engine query suggestions.
2. Distance Computation
Vector similarity can be measured with various metrics, each suited to different retrieval scenarios. Floating‑point vectors and binary vectors use different distance formulas.
2.1 Floating‑Point Vector Metrics
Inner Product (IP)
Euclidean (L2)
Cosine similarity
2.2 Binary Vector Metrics
Hamming distance
Jaccard distance
Tanimoto distance
Before computing distances, vectors are often normalized so that their magnitude equals 1.
2.1 Inner Product Distance
The inner product measures the alignment of two vectors; a larger value indicates smaller angle and higher similarity, making it suitable for recommendation scenarios where direction matters more than magnitude.
2.2 Euclidean Distance
Euclidean distance computes the straight‑line distance between two points; smaller values indicate higher similarity. After normalization, Euclidean distance is monotonically related to cosine similarity.
2.3 Cosine Distance
Cosine distance evaluates the angle between two normalized vectors; larger cosine similarity (or smaller cosine distance) indicates higher similarity, which is robust to differences in vector magnitude.
2.4 Hamming Distance
Hamming distance counts the number of differing bits between two equal‑length binary strings; it equals the minimum number of substitutions required to transform one string into the other.
2.5 Jaccard Distance
Jaccard similarity is the ratio of the size of the intersection to the size of the union of two sets; Jaccard distance = 1 – Jaccard similarity. For binary data it is equivalent to Tanimoto distance.
2.6 Tanimoto Distance
For binary variables, Tanimoto distance shares the same range (0 to 1) as Jaccard distance, with 1 indicating maximum similarity.
3. Fundamental Algorithms
Vector retrieval is essentially an Approximate Nearest Neighbor Search (ANNS) problem. The main families of ANNS algorithms are tree‑based, graph‑based, quantization‑based, and hashing‑based.
3.1 Tree‑Based Methods
Tree structures recursively partition the K‑dimensional space, allowing search to focus on a few sub‑spaces. They are simple to implement but scale poorly with high dimensionality.
K‑D Tree
Annoy (Approximate Nearest Neighbors Oh Yeah)
KD‑Tree
A KD‑Tree splits the space along the dimension with the largest variance, using the median as the split point. Construction continues recursively until leaf nodes contain few points.
Search proceeds by descending the tree according to the query point, backtracking when necessary to explore other branches that could contain closer neighbors.
Annoy
Annoy builds multiple random hyperplane trees to partition the space, reducing the chance that a query vector lies near a partition boundary. The index is stored as static read‑only files, enabling low memory usage and fast sharing across processes.
3.2 Graph‑Based Methods
Graph‑based structures connect each vector to a set of nearest neighbors, forming a navigable small‑world graph. They offer fast query speed at the cost of higher indexing time and memory consumption.
NSW (Navigating Small World)
HNSW (Hierarchical Navigating Small World)
NSW
NSW inserts points one by one, linking each new point to its m nearest existing points, creating “highways” that accelerate search.
HNSW
HNSW extends NSW with multiple hierarchical layers; upper layers are sparser, providing long‑range shortcuts, while lower layers are denser for fine‑grained search. Search starts from the top layer and descends, quickly converging to the nearest neighbors.
3.3 Quantization‑Based Methods
Quantization reduces high‑precision vectors to compact codes, trading a small amount of accuracy for large speed gains.
Scalar Quantization (SQ)
Product Quantization (PQ)
Product Quantization (PQ)
PQ splits a D‑dimensional vector into M sub‑vectors, quantizes each sub‑vector independently, and stores the indices of the nearest centroids. During search, distances are approximated using pre‑computed lookup tables, enabling fast asymmetric or symmetric distance computation.
3.4 Hashing‑Based Methods
Locality‑Sensitive Hashing (LSH) maps similar vectors to the same hash bucket with high probability, allowing fast retrieval by examining only the bucket’s contents.
4. Summary
This article presented the concept, application scenarios, distance metrics, and core algorithms of vector retrieval. While the overview covers the main ideas, readers are encouraged to explore each algorithm in depth for practical deployment and performance tuning.
IEG Growth Platform Technology Team
Official account of Tencent IEG Growth Platform Technology Team, showcasing cutting‑edge achievements across front‑end, back‑end, client, algorithm, testing and other domains.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.