Artificial Intelligence 22 min read

Introduction to Vector Retrieval, Distance Metrics, and Fundamental Algorithms

This article introduces the concept of vector retrieval, outlines its diverse application scenarios, explains common distance metrics for both floating‑point and binary vectors, and surveys fundamental approximate nearest‑neighbor algorithms including tree‑based, graph‑based, quantization, and hashing methods.

IEG Growth Platform Technology Team

Jan 17, 2022

Introduction to Vector Retrieval, Distance Metrics, and Fundamental Algorithms

1. Vector Retrieval Introduction

With the rapid growth of the Internet, massive unstructured data such as images, text, video, and audio are generated. After extracting feature vectors using AI techniques while respecting privacy, vector retrieval aims to efficiently search these vectors to enable analysis and retrieval of the original data.

1.1 Concept Overview

Vector retrieval focuses on computing similarity between feature vectors and returning the top‑K most similar vectors.

1.2 Application Scenarios

Recommendation systems (ads, "you may like" suggestions)

Image search (e.g., product image search, vehicle retrieval)

Natural language processing (semantic text search and recommendation)

Voice matching and audio retrieval

File deduplication via fingerprinting

Drug discovery searches

Examples include app splash‑screen ads, Taobao's "search by image", and search‑engine query suggestions.

2. Distance Computation

Vector similarity can be measured with various metrics, each suited to different retrieval scenarios. Floating‑point vectors and binary vectors use different distance formulas.

2.1 Floating‑Point Vector Metrics

Inner Product (IP)

Euclidean (L2)

Cosine similarity

2.2 Binary Vector Metrics

Hamming distance

Jaccard distance

Tanimoto distance

Before computing distances, vectors are often normalized so that their magnitude equals 1.

2.1 Inner Product Distance

The inner product measures the alignment of two vectors; a larger value indicates smaller angle and higher similarity, making it suitable for recommendation scenarios where direction matters more than magnitude.

2.2 Euclidean Distance

Euclidean distance computes the straight‑line distance between two points; smaller values indicate higher similarity. After normalization, Euclidean distance is monotonically related to cosine similarity.

2.3 Cosine Distance

Cosine distance evaluates the angle between two normalized vectors; larger cosine similarity (or smaller cosine distance) indicates higher similarity, which is robust to differences in vector magnitude.

2.4 Hamming Distance

Hamming distance counts the number of differing bits between two equal‑length binary strings; it equals the minimum number of substitutions required to transform one string into the other.

2.5 Jaccard Distance

Jaccard similarity is the ratio of the size of the intersection to the size of the union of two sets; Jaccard distance = 1 – Jaccard similarity. For binary data it is equivalent to Tanimoto distance.

2.6 Tanimoto Distance

For binary variables, Tanimoto distance shares the same range (0 to 1) as Jaccard distance, with 1 indicating maximum similarity.

3. Fundamental Algorithms

Vector retrieval is essentially an Approximate Nearest Neighbor Search (ANNS) problem. The main families of ANNS algorithms are tree‑based, graph‑based, quantization‑based, and hashing‑based.

3.1 Tree‑Based Methods

Tree structures recursively partition the K‑dimensional space, allowing search to focus on a few sub‑spaces. They are simple to implement but scale poorly with high dimensionality.

K‑D Tree

Annoy (Approximate Nearest Neighbors Oh Yeah)

KD‑Tree

A KD‑Tree splits the space along the dimension with the largest variance, using the median as the split point. Construction continues recursively until leaf nodes contain few points.

Search proceeds by descending the tree according to the query point, backtracking when necessary to explore other branches that could contain closer neighbors.

Annoy

Annoy builds multiple random hyperplane trees to partition the space, reducing the chance that a query vector lies near a partition boundary. The index is stored as static read‑only files, enabling low memory usage and fast sharing across processes.

3.2 Graph‑Based Methods

Graph‑based structures connect each vector to a set of nearest neighbors, forming a navigable small‑world graph. They offer fast query speed at the cost of higher indexing time and memory consumption.

NSW (Navigating Small World)

HNSW (Hierarchical Navigating Small World)

NSW

NSW inserts points one by one, linking each new point to its m nearest existing points, creating “highways” that accelerate search.

HNSW

HNSW extends NSW with multiple hierarchical layers; upper layers are sparser, providing long‑range shortcuts, while lower layers are denser for fine‑grained search. Search starts from the top layer and descends, quickly converging to the nearest neighbors.

3.3 Quantization‑Based Methods

Quantization reduces high‑precision vectors to compact codes, trading a small amount of accuracy for large speed gains.

Scalar Quantization (SQ)

Product Quantization (PQ)

PQ splits a D‑dimensional vector into M sub‑vectors, quantizes each sub‑vector independently, and stores the indices of the nearest centroids. During search, distances are approximated using pre‑computed lookup tables, enabling fast asymmetric or symmetric distance computation.

3.4 Hashing‑Based Methods

Locality‑Sensitive Hashing (LSH) maps similar vectors to the same hash bucket with high probability, allowing fast retrieval by examining only the bucket’s contents.

4. Summary

This article presented the concept, application scenarios, distance metrics, and core algorithms of vector retrieval. While the overview covers the main ideas, readers are encouraged to explore each algorithm in depth for practical deployment and performance tuning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

HNSW KD-Tree LSH product-quantization Vector Retrieval approximate nearest neighbor distance metrics

Written by

IEG Growth Platform Technology Team

Official account of Tencent IEG Growth Platform Technology Team, showcasing cutting‑edge achievements across front‑end, back‑end, client, algorithm, testing and other domains.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.