Running Local LLMs: Ollama vs Hugging Face – A Hands‑On Comparison
This guide compares Ollama and Hugging Face for running large language models locally, detailing API and local execution methods, installation steps, model selection, resource requirements, integration with AnythingLLM, container deployment, embedding and vector store setup, and practical observations on performance and limitations.
Running LLMs locally: Ollama vs Hugging Face
Both Ollama and Hugging Face can be used to run large language models on a local machine. Hugging Face requires a Python environment and an API key, while Ollama provides a single‑command workflow that works for non‑programmers.
Hugging Face local usage
Inference API (serverless)
import requests
API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-7b-hf"
headers = {"Authorization": "Bearer YOUR_API_KEY"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({"inputs": "Can you please let us know more details about your "})Local execution with Transformers pipeline
# Use a pipeline as a high‑level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf")Ollama installation and basic commands
Download URL: https://ollama.com/download ollama run llama2:7b Typical RAM requirements: 7B model ≥ 8 GB, 13B model ≥ 16 GB, 70B model ≥ 64 GB.
Sample interaction (no RAG, no fine‑tuning)
>>> what's the weather in Chengdu, China?
Currently, the weather in Chengdu, China is:
* Temperature: 24°C (75°F)
* Humidity: 60%
* Wind speed: 17 km/h (11 mph)
* Visibility: 10 km (6.2 miles)
* Sunrise: 6:30 AM
* Sunset: 7:00 PM
>>> what's today's date?
Today's date is March 14, 2023.
>>> what is DDD?
DDD (Domain‑Driven Design) is an approach to software development that emphasizes modeling the core business domain...Switch to another model:
ollama run codellamaAvailable Ollama models
ollama run llama2– Llama 2, 7B, ~3.8 GB ollama run mistral – Mistral, 7B, ~4.1 GB ollama run dolphin-phi – Dolphin Phi, 2.7B, ~1.6 GB ollama run phi – Phi‑2, 2.7B, ~1.7 GB ollama run neural-chat – Neural Chat, 7B, ~4.1 GB ollama run starling-lm – Starling, 7B, ~4.1 GB ollama run codellama – Code Llama, 7B, ~3.8 GB ollama run llama2-uncensored – Llama 2 Uncensored, 7B, ~3.8 GB ollama run llama2:13b – Llama 2, 13B, ~7.3 GB ollama run llama2:70b – Llama 2, 70B, ~39 GB ollama run orca-mini – Orca Mini, 3B, ~1.9 GB ollama run vicuna – Vicuna, 7B, ~3.8 GB ollama run llava – LLaVA, 7B, ~4.5 GB ollama run gemma:2b – Gemma, 2B, ~1.4 GB ollama run gemma:7b – Gemma, 7B, ~4.8 GB
AnythingLLM integration
Ollama can operate in two modes: chat mode (direct terminal interaction) and server mode (model runs as a backend service).
Start server mode
ollama serve curl http://localhost:11434
Ollama is runningDocker deployment of AnythingLLM
Reference URL: https://github.com/Mintplex-Labs/anything-llm/blob/master/docker/HOW_TO_USE_DOCKER.md
export STORAGE_LOCATION=$HOME/anythingllm && \
mkdir -p $STORAGE_LOCATION && \
touch "$STORAGE_LOCATION/.env" && \
docker run -d -p 3001:3001 \
--cap-add SYS_ADMIN \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
mintplexlabs/anythingllmAccess the UI at http://localhost:3001. When running inside Docker, replace localhost with host.docker.internal or 172.17.0.1 to reach the Ollama service on the host.
Embedding model selection
Example embedding model: nomic‑embed‑text (https://ollama.com/library/nomic-embed-text) or the default model provided by AnythingLLM.
Vector store setup
A local vector store can be built with LangChain, or a managed service such as Pinecone can be used.
RAG agent construction steps
Define the problem statement.
Choose a programming language.
Install required libraries.
Design the RAG architecture (retrieval, augmentation, generation).
Collect data from sources (websites, documents, APIs).
Process data into embeddings using the selected embedding model.
Store embeddings in the vector database.
Query the vector store, retrieve relevant chunks, and feed them to the LLM.
Test and validate the end‑to‑end pipeline.
Deploy the agent and monitor performance.
Importing external documents
AnythingLLM’s UI allows importing website content and a variety of local file formats (PDF, DOCX, TXT, etc.).
Key observations
Running LLMs locally requires substantial RAM; practical base models are typically ≤13 B.
Chinese language support is limited in many publicly available models.
CPU‑only inference can exhibit noticeable latency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Eric Tech Circle
Backend team lead & architect with 10+ years experience, full‑stack engineer, sharing insights and solo development practice.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
