Artificial Intelligence 17 min read

Understanding Vector Databases, ANN Algorithms, and Their Integration with Large Language Models

This article explains the fundamentals of vector databases, how high‑dimensional vector data is generated and stored, reviews common ANN search algorithms such as Flat, k‑means and LSH, discusses benchmarking and product selection, and demonstrates practical integration of vector stores with LLMs using LangChain and Python code.

Rare Earth Juejin Tech Community

Jan 12, 2024

Understanding Vector Databases, ANN Algorithms, and Their Integration with Large Language Models

Introduction

In the previous article we discussed the limitations of large language models (LLMs), especially token limits, which create many concerns when building LLM applications.

Vector databases are one way to address these concerns.

What Is a Vector Database?

Mathematically, a vector is an ordered sequence of numbers. In computer science, vectors can represent the features or attributes of an entity, and a vector database stores these feature vectors.

Sources of Vector Data

Vector data originates from the features of the objects we want to represent. For example, a dog can be described by size, hair length, and nose length, which can be recorded as a 2‑D or 3‑D vector.

Higher‑dimensional vectors can capture more characteristics such as eye size, obedience, aggressiveness, etc., by appending additional numbers to the vector.

OpenAI’s text‑embedding‑ada‑002 model can output 1536‑dimensional vectors; in production, thousands of dimensions and billions of vectors are common.

Vector Data Retrieval Algorithms

ANN Algorithms

Approximate Nearest Neighbor (ANN) algorithms quickly find one or more near‑neighbors in large datasets. Common types include Flat, k‑means, LSH, etc.

Flat

Flat performs exhaustive linear search (brute‑force). While accurate, it is slow for massive datasets.

Flat

k‑means

k‑means clusters the dataset into groups; during search, the query vector first finds the nearest centroid, then searches within that cluster, reducing the search space.

k‑means has drawbacks, e.g., a query may be closer to a point in another cluster, leading to missed results. Flat Solutions include k‑means++, spectral clustering, hierarchical clustering, DBSCAN, etc.

LSH

Locality‑Sensitive Hashing maps similar vectors to the same bucket using hash functions, enabling fast similarity checks.

LSH can be costly in large datasets because generating high‑quality random projection matrices is expensive.

Other Algorithms

Additional algorithms include HNSW (Hierarchical Navigable Small World) and various k‑means variants.

HNSW – https://arxiv.org/ftp/arxiv/papers/1603/1603.09320.pdf

k‑means – https://zh.wikipedia.org/wiki/K-%E5%B9%B3%E5%9D%87%E7%AE%97%E6%B3%95

ANN Benchmark

The ANN benchmark evaluates vector databases and ANN algorithms by measuring recall (accuracy) and RPS/QPS (queries per second).

It allows fair comparison of different products under identical conditions.

Summary

Algorithms are not silver bullets; each has strengths and weaknesses. ANN algorithms are core to vector databases, and the ANN benchmark helps select the most suitable solution.

Vector Database Products

Vector database products have proliferated. They can be classified by deployment (local vs. cloud), implementation openness, and supported search algorithms.

When choosing a product, consider distributed capabilities, supported data types and dimensions, scalability, API/integration, security, community support, and cost.

Professional vector DBs: chromadb, milvus, pinecone Databases with vector capabilities: PostgreSQL & pgvector,

ElasticSearch 8.0+

Vector Databases and LLMs Integration

Market Outlook

Companies like Zilliz have raised significant funding (US$60 M Series B) for their open‑source vector DB Milvus, indicating strong market confidence.

Connecting Vector DBs with LLMs

Using LangChain, we can build a local document knowledge base. The steps are:

Configure environment (Python, LangChain, ChatGLM2, chromadb).

Split documents into chunks and embed them.

Connect embeddings and LLM via LangChain.

Tokenization

Tokenization splits raw text into tokens; tiktoken is a fast BPE tokenizer released by OpenAI.

# 根据token拆分文本
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

Embeddings

Embeddings convert natural language into vectors. Common models include Word2Vec, GloVe, FastText, and OpenAI’s text‑embedding‑ada‑002.

In LangChain we can use OpenAIEmbeddings to generate vectors and store them in a vector store.

# 初始化 openai 的 embeddings 对象
embeddings = OpenAIEmbeddings()
# 将 document 通过 openai 的 embeddings 对象计算 embedding 向量信息并临时存入 Chroma 向量数据库，用于后续匹配查询
docsearch = Chroma.from_documents(split_docs, embeddings)

Connecting the LLM

Finally, a conversational retrieval chain enables user interaction.

# 初始化 openai embeddings
embeddings = OpenAIEmbeddings()

# 将数据存入向量存储
vector_store = Chroma.from_documents(documents, embeddings)
# 通过向量存储初始化检索器
retriever = vector_store.as_retriever()

system_template = """
Use the following context to answer the user's question.
If you don't know the answer, say you don't, don't try to make it up. And answer in Chinese.
-----------
{question}
-----------
{chat_history}
"""

messages = [
  SystemMessagePromptTemplate.from_template(system_template),
  HumanMessagePromptTemplate.from_template('{question}')
]

prompt = ChatPromptTemplate.from_messages(messages)

qa = ConversationalRetrievalChain.from_llm(
    ChatOpenAI(temperature=0.1, max_tokens=2048),
    retriever,
    condense_question_prompt=prompt
)

chat_history = []
while True:
    question = input('问题：')
    # 开始发送问题 chat_history 为必须参数,用于存储对话历史
    result = qa({'question': question, 'chat_history': chat_history})
    chat_history.append((question, result['answer']))
    print(result['answer'])

More Embeddings

Beyond text, image vectors can be generated with clip‑vit‑base‑patch32, audio vectors with wav2vec2‑base‑960h, and Chinese‑optimized models such as shibing624/text2vec‑base‑chinese.

Conclusion and Outlook

This article covered the basics of vector databases, their role in overcoming LLM token limits, common ANN search algorithms, product landscape, market prospects, and practical integration steps using LangChain.

Future developments will bring more innovative solutions; developers should choose the appropriate vector DB and ANN algorithm based on specific problem characteristics to achieve efficient and accurate data processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python LLM integration ANN embeddings Search Algorithms

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.