Run Claude Code Locally with Qwen 3.5 to Skip Anthropic API Costs

This guide shows how to replace Anthropic's API by running a local Qwen 3.5 model with llama.cpp, configuring Claude Code via ANTHROPIC_BASE_URL, and includes hardware checks, build steps, model download, server launch, speed‑fix tips, and usage instructions for secure, cost‑free development.

AI Engineering
AI Engineering
AI Engineering
Run Claude Code Locally with Qwen 3.5 to Skip Anthropic API Costs

Run Claude Code without Anthropic API

Set ANTHROPIC_BASE_URL to a locally hosted llama.cpp server to route Claude Code requests locally, avoiding external API costs and keeping data on‑premises.

Hardware suitability

Select model size based on GPU memory. Supported OS: Windows, macOS (Metal), Linux. NVIDIA GPUs give best performance.

Step 1: Build llama.cpp

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git-all -y
git clone https://github.com/ggml-org/llama.cpp

Compile with hardware flag:

cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON   # NVIDIA
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_METAL=ON   # macOS
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF   # CPU only
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Step 2: Download quantized Qwen 3.5 model

hf download unsloth/Qwen3.5-35B-A3B-GGUF --local-dir unsloth/Qwen3.5-35B-A3B-GGUF --include "*UD-Q4_K_XL*"

If GPU memory is insufficient, replace UD-Q4_K_XL with Q2_K or use the 27B/9B variants.

Step 3: Launch local model server

./llama.cpp/llama-server \
    --model unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
    --alias "unsloth/Qwen3.5-35B-A3B" \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.00 \
    --port 8001 \
    --kv-unified \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --flash-attn on --fit on \
    --ctx-size 131072

To skip the model’s thinking output and improve speed, add:

--chat-template-kwargs "{\"enable_thinking\": false}"

After starting, open http://localhost:8001 in a browser; the llama.cpp UI should appear.

Step 4: Point Claude Code to the local service

Mac/Linux

export ANTHROPIC_BASE_URL="http://localhost:8001"
export ANTHROPIC_API_KEY="sk-no-key-required"

Persist by adding the lines to ~/.bashrc or ~/.zshrc.

Windows PowerShell

$env:ANTHROPIC_BASE_URL="http://localhost:8001"
$env:ANTHROPIC_API_KEY="sk-no-key-required"

Make permanent with setx ANTHROPIC_BASE_URL "http://localhost:8001" or by editing the $PROFILE script.

Skip login prompt

"hasCompletedOnboarding": true,
"primaryApiKey": "sk-dummy-key"

Or enable “Disable Login Prompt” in the Claude Code extension settings.

Common pitfalls

Speed slowdown

Claude Code’s new attribution header disables KV cache, halving throughput. Disable it by editing ~/.claude/settings.json:

{
  "promptSuggestionEnabled": false,
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "0",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
  },
  "plansDirectory": "./plans",
  "effortLevel": "high"
}

Device recommendations

MacBook Pro M4 Max with 32 GB RAM may lag on the 35B model; use the 27B variant.

RTX 4090 with 24 GB VRAM fits the 35B UD‑Q4_K_XL model, consuming about 23 GB.

If memory is tight, lower the --ctx-size parameter or switch to a smaller quantized version.

Usage

From the project directory run: claude --model unsloth/Qwen3.5-35B-A3B To allow Claude to execute commands automatically, add the --dangerously-skip-permissions flag (use at your own risk). VS Code or Cursor Claude Code plugins also support in‑editor usage.

This setup is suitable for processing sensitive internal codebases, avoiding third‑party API exposure and saving costs, though some Claude Code tools may be unavailable and coding performance varies across models.

Official documentation: https://unsloth.ai/docs/basics/claude-code

GPU accelerationmodel quantizationllama.cppClaude Codelocal LLM deploymentAnthropic APIQwen 3.5
AI Engineering
Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.