Industry Insights 14 min read

Why Vector Retrieval Is the Backbone of Modern LLM Applications

The article explains how vectors represent data in high‑dimensional space, describes the embedding process, outlines the evolution and challenges of vector search, compares exact and approximate algorithms such as IVF, product quantization and HNSW, and details Baidu’s cloud‑native engineering solutions for scalable, filtered vector retrieval.

Baidu Geek Talk

Aug 9, 2023

Why Vector Retrieval Is the Backbone of Modern LLM Applications

1. Introduction to Vector Retrieval Applications

Vectors are points in a multi‑dimensional mathematical space whose coordinates are a series of numbers derived from real‑world objects after digitalization. The distance between vectors quantifies similarity, enabling semantic matching. Converting unstructured data into vectors is called embedding; deep‑learning models extract discrete features and project them into this space while preserving semantic similarity through distance.

Before large language models (LLMs) became popular, vector retrieval technology was already mature and widely used in image, audio, video search, face recognition, and speech recognition. The rise of LLMs has renewed interest in vector search as a way to augment knowledge and improve prompt engineering.

2. Overview of Vector Retrieval Technologies

The core problem is finding the top‑K most similar vectors among billions of candidates. The naive brute‑force method computes distances to every vector, which becomes infeasible as data grows. Distributed computation can speed up brute force but incurs high hardware costs.

Approximate nearest neighbor (ANN) algorithms trade exactness for speed. Four main families exist: hashing, tree‑based search, inverted indexing, and graph‑based search. Performance is measured by query latency, QPS capacity, and recall (the overlap with brute‑force results).

Inverted File (IVF) : Vectors are clustered with k‑means; each cluster center acts as a keyword for an inverted index. Queries first locate relevant clusters, then perform a limited brute‑force scan within them, dramatically reducing computation.

Product Quantization (PQ) : A D‑dimensional float vector is split into M sub‑vectors, each quantized to a centroid index, compressing the vector into a short integer code (e.g., 128‑dimensional to 8‑dimensional binary, achieving ~64× compression).

Graph‑based Search (HNSW) : Builds a multi‑layer small‑world graph where nodes are vectors and edges connect nearby points. Greedy navigation from higher‑level nodes quickly converges to the nearest neighbor. HNSW offers high recall with sub‑millisecond latency but consumes more memory.

All these algorithms require a distributed system for storage and compute. Vector data can be stored in columnar formats and combined with scalar fields for tag‑based filtering.

3. Engineering Practices for Vector Retrieval

Baidu’s cloud platform integrated vector search into Elasticsearch as early as 2020, providing a managed service that supports both vector and scalar queries. The architecture consists of a control plane for cluster management and BES (ElasticSearch) instances running on cloud servers with block storage, load‑balanced via a four‑layer proxy.

To achieve higher performance, Baidu re‑implemented the core vector engine in C++, selecting community libraries (faiss, nmslib) as a base and then optimizing them. Benchmarks showed HNSW consumes the most memory, while nmslib offered better baseline performance.

Key engineering enhancements include:

Asynchronous index construction: data is persisted first, then a background thread builds the HNSW graph to avoid blocking front‑end queries.

Optimized segment merging: consolidate graph construction during final merge to reduce intermediate computation.

Filtered HNSW traversal: modify the graph search to respect scalar filters, ensuring that only nodes satisfying the filter are considered, though recall drops sharply when filter selectivity exceeds 90%.

Hybrid execution plan: combine filtered brute‑force with HNSW to balance recall and latency.

These improvements enable large‑scale, low‑latency vector retrieval for LLM‑augmented applications.

4. Summary and Outlook

Vector databases are becoming as essential to LLM workloads as relational databases are to traditional web applications. The ecosystem now includes both commercial and open‑source solutions, and many classic storage systems are adding vector capabilities.

Baidu has developed its own Puck/Tinker vector retrieval algorithms, which have won the BigANN competition, and plans to launch a dedicated vector database service on its cloud platform to further support LLM use cases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native AI HNSW Embedding Vector Retrieval approximate nearest neighbor

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.