How to Build a Secure Local LLM Chatbot with Ollama, Python, and ChromaDB
This tutorial walks you through creating a privacy‑preserving, locally hosted large language model chatbot using Ollama, Python 3, and ChromaDB, covering RAG fundamentals, GPU selection, environment setup, and full source code for a Flask‑based application.
Why Build a Local LLM?
In an era where data privacy is critical, hosting your own large language model (LLM) on‑premises gives you full control over customization, privacy, security, and data processing, while eliminating reliance on internet connectivity.
Key Benefits
Full Customization : Tailor the retrieval‑augmented generation (RAG) pipeline to your exact needs.
Enhanced Privacy : Sensitive data never leaves your network.
Data Security : Reduce risk of leaks by keeping training documents (e.g., PDFs) in a secure environment.
Control Over Processing : Store private embeddings in ChromaDB and manage data handling yourself.
Offline Operation : The chatbot works without an internet connection.
Retrieval‑Augmented Generation (RAG)
RAG combines information retrieval with text generation to produce more accurate, context‑aware responses.
How RAG Works
Retrieval : The model queries an external knowledge base or vector store to fetch relevant documents.
Generation : The language model generates a response using the retrieved information.
Advantages of RAG
Improved accuracy through external data.
Better contextual relevance.
Scalable to large datasets.
Flexible – update the knowledge base without retraining the model.
Why Run RAG Locally?
Privacy and security of sensitive data.
Full customization of retrieval and generation pipelines.
Independence from internet connectivity.
GPU Considerations for Local LLMs
Running LLMs efficiently requires a powerful GPU for parallel processing, high‑bandwidth memory, and fast data embedding/retrieval.
Choosing the Right GPU
Memory capacity (VRAM) to fit model parameters.
Number of CUDA cores for parallel compute.
Memory bandwidth for rapid data transfer.
High‑Performance GPU Examples
NVIDIA RTX 3090 (24 GB VRAM).
NVIDIA A100 (AI‑optimized, large memory).
AMD Radeon Pro VII (high bandwidth).
Prerequisites
Python 3
ChromaDB (vector database)
Ollama (local LLM runtime)
Setup Steps
1. Install Python 3 and Create a Virtual Environment
$ python3 --version</code><code># Python 3.11.7 $ mkdir local-rag</code><code>$ cd local-rag $ python3 -m venv venv $ source venv/bin/activate</code><code># Windows: venv\Scripts\activate2. Install Dependencies
$ pip install --quiet chromadb $ pip install --quiet unstructured langchain langchain-text-splitters</code><code>$ pip install --quiet "unstructured[all-docs]" $ pip install --quiet flask3. Install Ollama
Download the installer for your OS from the Ollama website, then verify the installation:
$ ollama --version</code><code># ollama version is 0.1.47 $ ollama pull mistral $ ollama pull nomic-embed-text $ ollama serveBuilding the RAG Application
The application consists of four Python modules.
app.py (Flask entry point)
import os</code><code>from dotenv import load_dotenv</code><code>load_dotenv()</code><code>from flask import Flask, request, jsonify</code><code>from embed import embed</code><code>from query import query</code><code>from get_vector_db import get_vector_db</code><code>TEMP_FOLDER = os.getenv('TEMP_FOLDER', './_temp')</code><code>os.makedirs(TEMP_FOLDER, exist_ok=True)</code><code>app = Flask(__name__)</code><code>@app.route('/embed', methods=['POST'])</code><code>def route_embed():</code><code> if 'file' not in request.files:</code><code> return jsonify({"error": "No file part"}), 400</code><code> file = request.files['file']</code><code> if file.filename == '':</code><code> return jsonify({"error": "No selected file"}), 400</code><code> embedded = embed(file)</code><code> if embedded:</code><code> return jsonify({"message": "File embedded successfully"}), 200</code><code> return jsonify({"error": "File embedded unsuccessfully"}), 400</code><code>@app.route('/query', methods=['POST'])</code><code>def route_query():</code><code> data = request.get_json()</code><code> response = query(data.get('query'))</code><code> if response:</code><code> return jsonify({"message": response}), 200</code><code> return jsonify({"error": "Something went wrong"}), 400</code><code>if __name__ == '__main__':</code><code> app.run(host="0.0.0.0", port=8080, debug=True)embed.py (Document embedding)
import os</code><code>from datetime import datetime</code><code>from werkzeug.utils import secure_filename</code><code>from langchain_community.document_loaders import UnstructuredPDFLoader</code><code>from langchain_text_splitters import RecursiveCharacterTextSplitter</code><code>from get_vector_db import get_vector_db</code><code>TEMP_FOLDER = os.getenv('TEMP_FOLDER', './_temp')</code><code>def allowed_file(filename):</code><code> return '.' in filename and filename.rsplit('.', 1)[1].lower() in {'pdf'}</code><code>def save_file(file):</code><code> ct = datetime.now()</code><code> ts = ct.timestamp()</code><code> filename = f"{ts}_{secure_filename(file.filename)}"</code><code> file_path = os.path.join(TEMP_FOLDER, filename)</code><code> file.save(file_path)</code><code> return file_path</code><code>def load_and_split_data(file_path):</code><code> loader = UnstructuredPDFLoader(file_path=file_path)</code><code> data = loader.load()</code><code> splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)</code><code> return splitter.split_documents(data)</code><code>def embed(file):</code><code> if file.filename and allowed_file(file.filename):</code><code> path = save_file(file)</code><code> chunks = load_and_split_data(path)</code><code> db = get_vector_db()</code><code> db.add_documents(chunks)</code><code> db.persist()</code><code> os.remove(path)</code><code> return True</code><code> return Falsequery.py (Answer generation)
import os</code><code>from langchain_community.chat_models import ChatOllama</code><code>from langchain.prompts import ChatPromptTemplate, PromptTemplate</code><code>from langchain_core.output_parsers import StrOutputParser</code><code>from langchain_core.runnables import RunnablePassthrough</code><code>from langchain.retrievers.multi_query import MultiQueryRetriever</code><code>from get_vector_db import get_vector_db</code><code>LLM_MODEL = os.getenv('LLM_MODEL', 'mistral')</code><code>def get_prompt():</code><code> QUERY_PROMPT = PromptTemplate(</code><code> input_variables=["question"],</code><code> template="""You are an AI assistant. Generate five different versions of the given user question to retrieve relevant documents from a vector database. Provide each alternative on a new line. Original question: {question}"""</code><code> )</code><code> template = """Answer the question based ONLY on the following context:
{context}
Question: {question}"""</code><code> prompt = ChatPromptTemplate.from_template(template)</code><code> return QUERY_PROMPT, prompt</code><code>def query(input):</code><code> if not input:</code><code> return None</code><code> llm = ChatOllama(model=LLM_MODEL)</code><code> db = get_vector_db()</code><code> QUERY_PROMPT, prompt = get_prompt()</code><code> retriever = MultiQueryRetriever.from_llm(db.as_retriever(), llm, prompt=QUERY_PROMPT)</code><code> chain = ({"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser())</code><code> return chain.invoke(input)get_vector_db.py (Vector store initialization)
import os</code><code>from langchain_community.embeddings import OllamaEmbeddings</code><code>from langchain_community.vectorstores.chroma import Chroma</code><code>CHROMA_PATH = os.getenv('CHROMA_PATH', 'chroma')</code><code>COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'local-rag')</code><code>TEXT_EMBEDDING_MODEL = os.getenv('TEXT_EMBEDDING_MODEL', 'nomic-embed-text')</code><code>def get_vector_db():</code><code> embedding = OllamaEmbeddings(model=TEXT_EMBEDDING_MODEL, show_progress=True)</code><code> return Chroma(collection_name=COLLECTION_NAME, persist_directory=CHROMA_PATH, embedding_function=embedding)Running the Application
Create a .env file with the required variables:
TEMP_FOLDER=./_temp</code><code>CHROMA_PATH=chroma</code><code>COLLECTION_NAME=local-rag</code><code>LLM_MODEL=mistral</code><code>TEXT_EMBEDDING_MODEL=nomic-embed-textStart the Flask server: $ python3 app.py Use curl to embed a PDF and query the model:
$ curl -X POST http://localhost:8080/embed -F file=@/path/to/document.pdf $ curl -X POST http://localhost:8080/query -H "Content-Type: application/json" -d '{"query": "What is RAG?"}'Conclusion
By following these steps you can deploy a private, high‑performance RAG chatbot using Ollama, Python, and ChromaDB, giving you full control over data privacy, customization, and scalability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
