Run Claude Code Locally or in the Cloud in 5 Minutes with Ollama, LM Studio, llama.cpp, and OpenRouter
This guide shows how to configure Claude Code to run on local or cloud models within five minutes, covering hardware requirements, recommended models, step‑by‑step installation for Ollama, llama.cpp, LM Studio, and cloud‑based options, plus performance and cost comparisons.
Hardware requirements
For programming use, 32 GB RAM is recommended. Models with 24 B+ parameters run comfortably on a MacBook Pro M1 (32 GB) and on an Nvidia DGX Spark (120 GB, GB10 GPU). 16 GB can run tiny models but results in slower, error‑prone experience.
Recommended starter models
devstral-small-2 – 24 B – good starting point for coding quality
qwen3-coder:30b – 30 B – better coding ability, still usable on 32 GB
GLM4.7-flash:q8_0 – ~30 B (quantized) – excellent price‑performance trade‑off
Why use alternative models?
Claude Code on the official Anthropic API incurs rapidly increasing costs. Third‑party alternatives can reduce expenses by up to 98 % (e.g., DeepSeek V3.2 is the cheapest; Ollama’s local option is free).
Solution 1 – Ollama local model
Time: 5 minutes | Cost: Free | Scenario: privacy, no internet required
Install Ollama: curl -fsSL https://ollama.com/install.sh | sh Download a model (e.g., devstral-small-2): ollama pull devstral-small-2 Launch Claude Code with the model: ollama launch claude --model devstral-small-2 Add environment variables (e.g., to ~/.zshrc or ~/.bashrc):
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY="your api key"
export ANTHROPIC_BASE_URL="http://localhost:11434"Apply the configuration and run Claude Code:
source ~/.zshrc
claude --model devstral-small-2Performance on the Mac M1 was acceptable with devstral-small-2. qwen3-coder (32 B) was too slow, while glm-4.7-flash (30 B, F16) matched the speed of Claude Opus 4.5.
Solution 2 – llama.cpp
Time: 15‑20 minutes | Cost: Free | Scenario: any HuggingFace model
Compile llama.cpp.
macOS (Apple Silicon):
brew install cmake
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DGGML_METAL=ON
cmake --build llama.cpp/build --config Release -j
cp llama.cpp/build/bin/llama-* llama.cpp/Linux (NVIDIA GPU):
sudo apt-get update && sudo apt-get install build-essential cmake git -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j
cp llama.cpp/build/bin/llama-* llama.cpp/Start a server with a HuggingFace model (example: bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q4_K_M):
llama-server -hf bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q4_K_M \
--alias "Qwen3-Coder-REAP-25B-A3B-GGUF" \
--port 8000 \
--jinja \
--kv-unified \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on \
--batch-size 4096 --ubatch-size 1024 \
--ctx-size 64000The --jinja flag is essential for tool calls.
Connect Claude Code to the server:
export ANTHROPIC_BASE_URL="http://localhost:8000"
claude --model Qwen3-Coder-REAP-25B-A3B-GGUFSolution 3 – LM Studio
Time: 5 minutes | Cost: Free | Scenario: privacy, no internet, graphical UI
Install LM Studio:
curl -fsSL https://lmstudio.ai/install.sh | bashDownload a model via the LM Studio UI (e.g., qwen3-coder).
Start the LM Studio server on a chosen port (e.g., 1234): lms server start -port 1234 Set environment variables:
export ANTHROPIC_BASE_URL="http://localhost:1234"
export ANTHROPIC_AUTH_TOKEN="lmstudio"Run Claude Code with the selected model:
claude --model qwen/qwen3-coder-30bSolution 4 – Ollama cloud model
Time: 2 minutes | Cost: pay‑as‑you‑go | Scenario: local workflow with cloud compute
Pull a cloud‑enabled model (e.g., kimi-k2.5:cloud or minimax-m2.1:cloud):
ollama pull kimi-k2.5:cloud
ollama pull minimax-m2.1:cloudLaunch Claude Code with the cloud model: ollama launch claude --model minimax-m2.1:cloud The :cloud variant runs on Ollama’s infrastructure with the same CLI, eliminating API‑key management.
Solution 5 – Direct cloud‑provider API
Time: 2 minutes | Cost: pay‑as‑you‑go | Scenario: direct API access, more control
Configure Claude Code to use OpenRouter (or another provider) by setting environment variables:
export ANTHROPIC_BASE_URL="https://openrouter.ai/api"
export ANTHROPIC_API_KEY="" # leave empty for OpenRouter
export ANTHROPIC_MODEL="openai/gpt-oss-120b:free"Run Claude Code with the chosen provider model: claude --model openai/gpt-oss-120b:free Similar configurations work for Minimax, DeepSeek, Kimi, or GLM by adjusting ANTHROPIC_BASE_URL, ANTHROPIC_MODEL, and ANTHROPIC_API_KEY.
Conclusion
Claude Code is now highly flexible: it can remain on Anthropic’s API, run locally on a Mac M1 with devstral-small-2 (24 B) for privacy, leverage powerful GPUs on a Nvidia DGX Spark for larger models, or use inexpensive cloud providers such as Kimi, Minimax, DeepSeek, or GLM, which can be up to 98 % cheaper than Opus 4.5. The accompanying GitHub repository ( https://github.com/luongnv89/claude-howto) contains the full scripts and configuration details.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
