Artificial Intelligence 19 min read

How to Build a Secure Local LLM Chatbot with Ollama, Python, and ChromaDB

This tutorial walks you through creating a privacy‑preserving, locally hosted large language model chatbot using Ollama, Python 3, and ChromaDB, covering RAG fundamentals, GPU selection, environment setup, and full source code for a Flask‑based application.

21CTO

Jul 7, 2024

How to Build a Secure Local LLM Chatbot with Ollama, Python, and ChromaDB

Why Build a Local LLM?

In an era where data privacy is critical, hosting your own large language model (LLM) on‑premises gives you full control over customization, privacy, security, and data processing, while eliminating reliance on internet connectivity.

Key Benefits

Full Customization : Tailor the retrieval‑augmented generation (RAG) pipeline to your exact needs.

Enhanced Privacy : Sensitive data never leaves your network.

Data Security : Reduce risk of leaks by keeping training documents (e.g., PDFs) in a secure environment.

Control Over Processing : Store private embeddings in ChromaDB and manage data handling yourself.

Offline Operation : The chatbot works without an internet connection.

Retrieval‑Augmented Generation (RAG)

RAG combines information retrieval with text generation to produce more accurate, context‑aware responses.

How RAG Works

Retrieval : The model queries an external knowledge base or vector store to fetch relevant documents.

Generation : The language model generates a response using the retrieved information.

Advantages of RAG

Improved accuracy through external data.

Better contextual relevance.

Scalable to large datasets.

Flexible – update the knowledge base without retraining the model.

Why Run RAG Locally?

Privacy and security of sensitive data.

Full customization of retrieval and generation pipelines.

Independence from internet connectivity.

GPU Considerations for Local LLMs

Running LLMs efficiently requires a powerful GPU for parallel processing, high‑bandwidth memory, and fast data embedding/retrieval.

Choosing the Right GPU

Memory capacity (VRAM) to fit model parameters.

Number of CUDA cores for parallel compute.

Memory bandwidth for rapid data transfer.

High‑Performance GPU Examples

NVIDIA RTX 3090 (24 GB VRAM).

NVIDIA A100 (AI‑optimized, large memory).

AMD Radeon Pro VII (high bandwidth).

Prerequisites

Python 3

ChromaDB (vector database)

Ollama (local LLM runtime)

Setup Steps

1. Install Python 3 and Create a Virtual Environment

$ python3 --version</code><code># Python 3.11.7

$ mkdir local-rag</code><code>$ cd local-rag

$ python3 -m venv venv

$ source venv/bin/activate</code><code># Windows: venv\Scripts\activate

2. Install Dependencies

$ pip install --quiet chromadb

$ pip install --quiet unstructured langchain langchain-text-splitters</code><code>$ pip install --quiet "unstructured[all-docs]"

$ pip install --quiet flask

3. Install Ollama

Download the installer for your OS from the Ollama website, then verify the installation:

$ ollama --version</code><code># ollama version is 0.1.47

$ ollama pull mistral

$ ollama pull nomic-embed-text

$ ollama serve

Building the RAG Application

The application consists of four Python modules.

app.py (Flask entry point)

import os</code><code>from dotenv import load_dotenv</code><code>load_dotenv()</code><code>from flask import Flask, request, jsonify</code><code>from embed import embed</code><code>from query import query</code><code>from get_vector_db import get_vector_db</code><code>TEMP_FOLDER = os.getenv('TEMP_FOLDER', './_temp')</code><code>os.makedirs(TEMP_FOLDER, exist_ok=True)</code><code>app = Flask(__name__)</code><code>@app.route('/embed', methods=['POST'])</code><code>def route_embed():</code><code>    if 'file' not in request.files:</code><code>        return jsonify({"error": "No file part"}), 400</code><code>    file = request.files['file']</code><code>    if file.filename == '':</code><code>        return jsonify({"error": "No selected file"}), 400</code><code>    embedded = embed(file)</code><code>    if embedded:</code><code>        return jsonify({"message": "File embedded successfully"}), 200</code><code>    return jsonify({"error": "File embedded unsuccessfully"}), 400</code><code>@app.route('/query', methods=['POST'])</code><code>def route_query():</code><code>    data = request.get_json()</code><code>    response = query(data.get('query'))</code><code>    if response:</code><code>        return jsonify({"message": response}), 200</code><code>    return jsonify({"error": "Something went wrong"}), 400</code><code>if __name__ == '__main__':</code><code>    app.run(host="0.0.0.0", port=8080, debug=True)

embed.py (Document embedding)

import os</code><code>from datetime import datetime</code><code>from werkzeug.utils import secure_filename</code><code>from langchain_community.document_loaders import UnstructuredPDFLoader</code><code>from langchain_text_splitters import RecursiveCharacterTextSplitter</code><code>from get_vector_db import get_vector_db</code><code>TEMP_FOLDER = os.getenv('TEMP_FOLDER', './_temp')</code><code>def allowed_file(filename):</code><code>    return '.' in filename and filename.rsplit('.', 1)[1].lower() in {'pdf'}</code><code>def save_file(file):</code><code>    ct = datetime.now()</code><code>    ts = ct.timestamp()</code><code>    filename = f"{ts}_{secure_filename(file.filename)}"</code><code>    file_path = os.path.join(TEMP_FOLDER, filename)</code><code>    file.save(file_path)</code><code>    return file_path</code><code>def load_and_split_data(file_path):</code><code>    loader = UnstructuredPDFLoader(file_path=file_path)</code><code>    data = loader.load()</code><code>    splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)</code><code>    return splitter.split_documents(data)</code><code>def embed(file):</code><code>    if file.filename and allowed_file(file.filename):</code><code>        path = save_file(file)</code><code>        chunks = load_and_split_data(path)</code><code>        db = get_vector_db()</code><code>        db.add_documents(chunks)</code><code>        db.persist()</code><code>        os.remove(path)</code><code>        return True</code><code>    return False

query.py (Answer generation)

import os</code><code>from langchain_community.chat_models import ChatOllama</code><code>from langchain.prompts import ChatPromptTemplate, PromptTemplate</code><code>from langchain_core.output_parsers import StrOutputParser</code><code>from langchain_core.runnables import RunnablePassthrough</code><code>from langchain.retrievers.multi_query import MultiQueryRetriever</code><code>from get_vector_db import get_vector_db</code><code>LLM_MODEL = os.getenv('LLM_MODEL', 'mistral')</code><code>def get_prompt():</code><code>    QUERY_PROMPT = PromptTemplate(</code><code>        input_variables=["question"],</code><code>        template="""You are an AI assistant. Generate five different versions of the given user question to retrieve relevant documents from a vector database. Provide each alternative on a new line. Original question: {question}"""</code><code>    )</code><code>    template = """Answer the question based ONLY on the following context:
{context}
Question: {question}"""</code><code>    prompt = ChatPromptTemplate.from_template(template)</code><code>    return QUERY_PROMPT, prompt</code><code>def query(input):</code><code>    if not input:</code><code>        return None</code><code>    llm = ChatOllama(model=LLM_MODEL)</code><code>    db = get_vector_db()</code><code>    QUERY_PROMPT, prompt = get_prompt()</code><code>    retriever = MultiQueryRetriever.from_llm(db.as_retriever(), llm, prompt=QUERY_PROMPT)</code><code>    chain = ({"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser())</code><code>    return chain.invoke(input)

get_vector_db.py (Vector store initialization)

import os</code><code>from langchain_community.embeddings import OllamaEmbeddings</code><code>from langchain_community.vectorstores.chroma import Chroma</code><code>CHROMA_PATH = os.getenv('CHROMA_PATH', 'chroma')</code><code>COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'local-rag')</code><code>TEXT_EMBEDDING_MODEL = os.getenv('TEXT_EMBEDDING_MODEL', 'nomic-embed-text')</code><code>def get_vector_db():</code><code>    embedding = OllamaEmbeddings(model=TEXT_EMBEDDING_MODEL, show_progress=True)</code><code>    return Chroma(collection_name=COLLECTION_NAME, persist_directory=CHROMA_PATH, embedding_function=embedding)

Running the Application

Create a .env file with the required variables:

TEMP_FOLDER=./_temp</code><code>CHROMA_PATH=chroma</code><code>COLLECTION_NAME=local-rag</code><code>LLM_MODEL=mistral</code><code>TEXT_EMBEDDING_MODEL=nomic-embed-text

Start the Flask server: $ python3 app.py Use curl to embed a PDF and query the model:

$ curl -X POST http://localhost:8080/embed -F file=@/path/to/document.pdf

$ curl -X POST http://localhost:8080/query -H "Content-Type: application/json" -d '{"query": "What is RAG?"}'

Conclusion

By following these steps you can deploy a private, high‑performance RAG chatbot using Ollama, Python, and ChromaDB, giving you full control over data privacy, customization, and scalability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python LLM RAG local deployment Ollama ChromaDB

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Why Build a Local LLM?

Key Benefits

Retrieval‑Augmented Generation (RAG)

How RAG Works

Advantages of RAG

Why Run RAG Locally?

GPU Considerations for Local LLMs

Choosing the Right GPU

High‑Performance GPU Examples

Prerequisites

Setup Steps

1. Install Python 3 and Create a Virtual Environment

2. Install Dependencies

3. Install Ollama

Building the RAG Application

app.py (Flask entry point)

embed.py (Document embedding)

query.py (Answer generation)

get_vector_db.py (Vector store initialization)

Running the Application

Conclusion

21CTO

How this landed with the community

Was this worth your time?

0 Comments

1. Install Python 3 and Create a Virtual Environment