Databases 11 min read

Why Vector Databases Are Essential for Building Industry‑Specific LLM Applications

Vector databases enable efficient similarity search and storage of high‑dimensional embeddings, allowing enterprises to combine large language models with proprietary knowledge assets to create domain‑specific, accurate, and up‑to‑date AI services, as illustrated with open‑source solutions Chroma and Milvus.

ITPUB

Jul 5, 2023

Why Vector Databases Are Essential for Building Industry‑Specific LLM Applications

Why Vector Databases Are Needed for Industry‑Specific LLM Applications

Large language models (LLMs) answer general questions well but often lack depth, accuracy, and timeliness for vertical domains such as medicine or law. Storing enterprise knowledge as vector embeddings in a vector database lets companies augment LLMs with proprietary, up‑to‑date information, enabling precise, domain‑specific AI services.

What Is a Vector?

A vector is a numerical representation of text, images, audio, or other unstructured data. Converting content into vectors enables similarity calculations, semantic search, and reasoning over the data.

💡 A vector is the bridge between a model and a knowledge base. Vector embeddings are a native AI data format that can represent text, images, audio, and video.

Vector Embeddings

Roy Keynes defines embeddings as "a learned transformation that makes data more useful." Neural networks map text into a vector space where semantic relationships become geometric, enabling operations such as finding synonyms or analogies (e.g., Queen = King – Man + Woman).

Functions of a Vector Database

Vector databases store and process high‑dimensional vectors, providing fast similarity search. The core operation is computing distances between a query vector and stored vectors to retrieve the most similar items.

To improve performance, approximate nearest neighbor (ANN) algorithms such as Locality Sensitive Hashing (LSH), Hierarchical Navigable Small Worlds (HNSW), or Inverted File Index (IVF) are used, trading a small amount of accuracy for speed.

The workflow consists of three steps:

Use an embedding model to convert raw content (text, images, video, etc.) into vectors.

Insert the vectors, together with the original content, into the vector database.

At query time, embed the query with the same model and search for similar vectors, retrieving the associated original documents.

Open‑Source Vector DB: Chroma

Chroma is an open‑source embedding database designed for storing and retrieving vector embeddings. It supports efficient similarity search, scalable storage, and flexible architecture.

GitHub: https://github.com/chroma-core/chroma

import chromadb
# setup Chroma in‑memory for quick prototyping
client = chromadb.Client()
collection = client.create_collection("all-my-documents")
collection.add(
    documents=["This is document1", "This is document2"],
    metadatas=[{"source": "notion"}, {"source": "google-docs"}],
    ids=["doc1", "doc2"]
)
results = collection.query(
    query_texts=["This is a query document"],
    n_results=2
)

Supported embedding functions include:

All‑MiniLM‑L6‑v2 (Sentence‑Transformers)

OpenAI embeddings (e.g., text‑embedding‑ada‑002)

Instructor models (e.g., hkunlp/instructor‑xl)

Google PaLM API models

Open‑Source Vector DB: Milvus

Milvus is the most‑starred open‑source vector database on GitHub. It offers high‑performance, scalable storage and a variety of indexing algorithms for large‑scale vector data, suitable for recommendation systems, image search, NLP, and more.

GitHub: https://github.com/milvus-io/milvus

Milvus also provides a managed cloud service (Zilliz Cloud) for easier experimentation.

Connecting to Milvus with Python (pymilvus)

import pandas as pd
from pymilvus import connections, utility, FieldSchema, CollectionSchema, DataType, Collection
conn = connections.connect(
    "default",
    host="in01-70ff1fe5d9bc5a0.aws-us-west-2.vectordb.zillizcloud.com",
    port="19537",
    secure=True,
    user='db_admin',
    password=snbGetValue("milvus_pw")
)
has = utility.has_collection("medium_articles")
print(f"Does collection medium_articles exist in Milvus: {has}")

Retrieve an existing collection and load it:

collection = Collection("medium_articles")  # Get an existing collection.
collection.load()

Example query and vector search results are visualized in the following screenshots:

These examples demonstrate how vector databases, combined with LLMs, enable enterprises to build private, domain‑specific AI assistants that deliver accurate and timely responses.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM Milvus similarity search embeddings Chroma

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.