Artificial Intelligence 14 min read

How Vector Retrieval Powers Large Language Models: Techniques and Practices

This article explains the fundamentals of vector retrieval, its role in enhancing large language models through embedding and prompt engineering, and details the algorithms, system architecture, and Baidu's engineering practices for building high‑performance vector databases.

Baidu Intelligent Cloud Tech Hub

Jul 17, 2023

How Vector Retrieval Powers Large Language Models: Techniques and Practices

1. Introduction to Vector Retrieval Applications

Vectors are points in a multi‑dimensional mathematical space, represented by a series of numbers; the distance between points reflects similarity between the underlying real‑world objects.

Transforming unstructured data into vectors is called embedding . Deep‑learning models extract discrete features from digitized real‑world data, projecting them into a mathematical space while preserving semantic similarity via vector distances.

Before large language models (LLMs) emerged, vector retrieval was already mature and widely used in image, audio, video search, face recognition, and speech recognition.

The rise of LLMs has created a new AI revolution, but challenges remain: limited knowledge capacity, high training and inference costs, hallucinations, privacy risks, and expensive reasoning.

Prompt engineering—supplying external data alongside queries—enhances LLM knowledge and safety, effectively turning the process into a search‑engine‑like workflow.

2. Overview of Vector Retrieval Technology

The core of vector engineering is the vector search algorithm, which finds the top‑K most similar vectors to a query vector.

Brute‑force search computes distances to all vectors, but its latency grows linearly with data size, becoming unacceptable at scale.

Distributed computing can parallelize the workload across multiple machines, but at high compute cost.

Approximate nearest neighbor (ANN) algorithms reduce computation by sacrificing exactness; common approaches include hashing, tree search, inverted indexes, and graph search.

Performance is measured by query latency/QPS and recall rate (accuracy compared to brute‑force results).

Inverted File (IVF) uses k‑means clustering to create “keywords” for vectors, enabling a two‑stage search: first locate relevant clusters, then perform fine‑grained search within them.

Product Quantization (PQ) compresses high‑dimensional vectors into lower‑dimensional binary codes, reducing storage and computation.

Graph‑based ANN, such as HNSW, builds a small‑world network where greedy search traverses edges to approximate nearest neighbors.

HNSW constructs multi‑layer graphs with exponentially decreasing node counts per layer, enabling efficient long‑range connections.

Vector retrieval systems combine a vector storage layer with search capabilities and optional scalar filtering.

3. Vector Retrieval Engineering Practice

Baidu integrates vector search into ElasticSearch, offering cloud‑native services and custom plugins written in C++ for performance.

Comparisons of open‑source libraries (nmslib vs. faiss) show HNSW’s high memory usage; Baidu builds on nmslib and further optimizes it.

To avoid front‑end blocking during index construction, Baidu implements asynchronous HNSW building with background threads and merges.

For scenarios requiring scalar filtering before vector search, Baidu modifies HNSW to consider filter conditions during graph traversal, and combines filtered brute‑force search with HNSW to maintain recall.

4. Summary and Outlook

Vector databases are becoming essential components for LLM applications, similar to relational databases for web apps.

Industry offers many specialized vector DB products, both commercial and open‑source, and traditional storage systems are adding vector capabilities.

Baidu’s self‑developed Puck/Tinker vector search algorithms have won the BigANN competition, and Baidu Cloud plans to launch a dedicated vector database to support large‑scale AI workloads.