Run Claude Code Locally with Qwen 3.5 to Skip Anthropic API Costs
This guide shows how to replace Anthropic's API by running a local Qwen 3.5 model with llama.cpp, configuring Claude Code via ANTHROPIC_BASE_URL, and includes hardware checks, build steps, model download, server launch, speed‑fix tips, and usage instructions for secure, cost‑free development.
Run Claude Code without Anthropic API
Set ANTHROPIC_BASE_URL to a locally hosted llama.cpp server to route Claude Code requests locally, avoiding external API costs and keeping data on‑premises.
Hardware suitability
Select model size based on GPU memory. Supported OS: Windows, macOS (Metal), Linux. NVIDIA GPUs give best performance.
Step 1: Build llama.cpp
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git-all -y
git clone https://github.com/ggml-org/llama.cppCompile with hardware flag:
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON # NVIDIA
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_METAL=ON # macOS
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF # CPU only cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cppStep 2: Download quantized Qwen 3.5 model
hf download unsloth/Qwen3.5-35B-A3B-GGUF --local-dir unsloth/Qwen3.5-35B-A3B-GGUF --include "*UD-Q4_K_XL*"If GPU memory is insufficient, replace UD-Q4_K_XL with Q2_K or use the 27B/9B variants.
Step 3: Launch local model server
./llama.cpp/llama-server \
--model unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
--alias "unsloth/Qwen3.5-35B-A3B" \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--port 8001 \
--kv-unified \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on --fit on \
--ctx-size 131072To skip the model’s thinking output and improve speed, add:
--chat-template-kwargs "{\"enable_thinking\": false}"After starting, open http://localhost:8001 in a browser; the llama.cpp UI should appear.
Step 4: Point Claude Code to the local service
Mac/Linux
export ANTHROPIC_BASE_URL="http://localhost:8001"
export ANTHROPIC_API_KEY="sk-no-key-required"Persist by adding the lines to ~/.bashrc or ~/.zshrc.
Windows PowerShell
$env:ANTHROPIC_BASE_URL="http://localhost:8001"
$env:ANTHROPIC_API_KEY="sk-no-key-required"Make permanent with setx ANTHROPIC_BASE_URL "http://localhost:8001" or by editing the $PROFILE script.
Skip login prompt
"hasCompletedOnboarding": true,
"primaryApiKey": "sk-dummy-key"Or enable “Disable Login Prompt” in the Claude Code extension settings.
Common pitfalls
Speed slowdown
Claude Code’s new attribution header disables KV cache, halving throughput. Disable it by editing ~/.claude/settings.json:
{
"promptSuggestionEnabled": false,
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "0",
"CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
},
"plansDirectory": "./plans",
"effortLevel": "high"
}Device recommendations
MacBook Pro M4 Max with 32 GB RAM may lag on the 35B model; use the 27B variant.
RTX 4090 with 24 GB VRAM fits the 35B UD‑Q4_K_XL model, consuming about 23 GB.
If memory is tight, lower the --ctx-size parameter or switch to a smaller quantized version.
Usage
From the project directory run: claude --model unsloth/Qwen3.5-35B-A3B To allow Claude to execute commands automatically, add the --dangerously-skip-permissions flag (use at your own risk). VS Code or Cursor Claude Code plugins also support in‑editor usage.
This setup is suitable for processing sensitive internal codebases, avoiding third‑party API exposure and saving costs, though some Claude Code tools may be unavailable and coding performance varies across models.
Official documentation: https://unsloth.ai/docs/basics/claude-code
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
