Artificial Intelligence 15 min read

Running Local LLMs: Ollama vs Hugging Face – A Hands‑On Comparison

This guide compares Ollama and Hugging Face for running large language models locally, detailing API and local execution methods, installation steps, model selection, resource requirements, integration with AnythingLLM, container deployment, embedding and vector store setup, and practical observations on performance and limitations.

Eric Tech Circle

Mar 24, 2024

Running Local LLMs: Ollama vs Hugging Face – A Hands‑On Comparison

Running LLMs locally: Ollama vs Hugging Face

Both Ollama and Hugging Face can be used to run large language models on a local machine. Hugging Face requires a Python environment and an API key, while Ollama provides a single‑command workflow that works for non‑programmers.

Hugging Face local usage

Inference API (serverless)

import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-7b-hf"
headers = {"Authorization": "Bearer YOUR_API_KEY"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({"inputs": "Can you please let us know more details about your "})

Local execution with Transformers pipeline

# Use a pipeline as a high‑level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf")

Ollama installation and basic commands

Download URL: https://ollama.com/download ollama run llama2:7b Typical RAM requirements: 7B model ≥ 8 GB, 13B model ≥ 16 GB, 70B model ≥ 64 GB.

Sample interaction (no RAG, no fine‑tuning)

>>> what's the weather in Chengdu, China?
Currently, the weather in Chengdu, China is:
* Temperature: 24°C (75°F)
* Humidity: 60%
* Wind speed: 17 km/h (11 mph)
* Visibility: 10 km (6.2 miles)
* Sunrise: 6:30 AM
* Sunset: 7:00 PM

>>> what's today's date?
Today's date is March 14, 2023.

>>> what is DDD?
DDD (Domain‑Driven Design) is an approach to software development that emphasizes modeling the core business domain...

Switch to another model:

ollama run codellama

Available Ollama models

ollama run llama2

– Llama 2, 7B, ~3.8 GB ollama run mistral – Mistral, 7B, ~4.1 GB ollama run dolphin-phi – Dolphin Phi, 2.7B, ~1.6 GB ollama run phi – Phi‑2, 2.7B, ~1.7 GB ollama run neural-chat – Neural Chat, 7B, ~4.1 GB ollama run starling-lm – Starling, 7B, ~4.1 GB ollama run codellama – Code Llama, 7B, ~3.8 GB ollama run llama2-uncensored – Llama 2 Uncensored, 7B, ~3.8 GB ollama run llama2:13b – Llama 2, 13B, ~7.3 GB ollama run llama2:70b – Llama 2, 70B, ~39 GB ollama run orca-mini – Orca Mini, 3B, ~1.9 GB ollama run vicuna – Vicuna, 7B, ~3.8 GB ollama run llava – LLaVA, 7B, ~4.5 GB ollama run gemma:2b – Gemma, 2B, ~1.4 GB ollama run gemma:7b – Gemma, 7B, ~4.8 GB

AnythingLLM integration

Ollama can operate in two modes: chat mode (direct terminal interaction) and server mode (model runs as a backend service).

Start server mode

ollama serve

curl http://localhost:11434
Ollama is running

Docker deployment of AnythingLLM

Reference URL: https://github.com/Mintplex-Labs/anything-llm/blob/master/docker/HOW_TO_USE_DOCKER.md

export STORAGE_LOCATION=$HOME/anythingllm && \
mkdir -p $STORAGE_LOCATION && \
 touch "$STORAGE_LOCATION/.env" && \
 docker run -d -p 3001:3001 \
   --cap-add SYS_ADMIN \
   -v ${STORAGE_LOCATION}:/app/server/storage \
   -v ${STORAGE_LOCATION}/.env:/app/server/.env \
   -e STORAGE_DIR="/app/server/storage" \
   mintplexlabs/anythingllm

Access the UI at http://localhost:3001. When running inside Docker, replace localhost with host.docker.internal or 172.17.0.1 to reach the Ollama service on the host.

Embedding model selection

Example embedding model: nomic‑embed‑text (https://ollama.com/library/nomic-embed-text) or the default model provided by AnythingLLM.

Vector store setup

A local vector store can be built with LangChain, or a managed service such as Pinecone can be used.

RAG agent construction steps

Define the problem statement.

Choose a programming language.

Install required libraries.

Design the RAG architecture (retrieval, augmentation, generation).

Collect data from sources (websites, documents, APIs).

Process data into embeddings using the selected embedding model.

Store embeddings in the vector database.

Query the vector store, retrieve relevant chunks, and feed them to the LLM.

Test and validate the end‑to‑end pipeline.

Deploy the agent and monitor performance.

Importing external documents

AnythingLLM’s UI allows importing website content and a variety of local file formats (PDF, DOCX, TXT, etc.).

Key observations

Running LLMs locally requires substantial RAM; practical base models are typically ≤13 B.

Chinese language support is limited in many publicly available models.

CPU‑only inference can exhibit noticeable latency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker LLM RAG vector database Embedding Ollama AnythingLLM Hugging Face

Written by

Eric Tech Circle

Backend team lead & architect with 10+ years experience, full‑stack engineer, sharing insights and solo development practice.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.