Run Local LLM Agents on Claude Code, Codex and OpenClaw with Just 24 GB VRAM via Unsloth API
The article explains how Unsloth’s dual‑protocol API lets you run Claude Code, Codex and OpenClaw locally on a 24 GB GPU, details installation steps, hardware limits, configuration for each CLI, and shares real‑world performance pros and cons.
Overview
Unsloth has released a dual‑protocol API that combines OpenAI‑compatible and Anthropic‑compatible endpoints on a single port. By launching the service with one command, Claude Code, Codex, OpenClaw, OpenCode, Cursor and Cline can all connect to a locally hosted model without sending data to the cloud.
Unsloth API Details
The API sits on top of llama.cpp ’s llama-server and exposes three main routes: POST /v1/messages – Anthropic Messages API for Claude Code, Anthropic SDK, OpenClaw. POST /v1/chat/completions and /v1/responses – OpenAI‑compatible endpoints for OpenAI SDK, Codex, OpenCode, Cursor, Continue, Cline, Open WebUI. GET /v1/models – List currently loaded models.
Authentication mirrors OpenAI: include Authorization: Bearer sk‑unsloth‑… in the request header. The key is printed once when the server starts.
Key Features
Self‑healing tool calling : the server automatically fixes malformed tool‑call JSON before returning it, improving success rates.
Server‑side code execution : enable enable_tools: true and list enabled_tools (e.g., ["python", "bash"]) to run Bash or Python in a sandbox and stream results back.
Advanced web search : the model can fetch full web pages and read the main content, not just snippets.
Installation
Two steps are required: install Unsloth Studio and then install the Agent CLI you want to use.
Install Unsloth Studio (one line)
# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh
# Windows PowerShell
irm https://unsloth.ai/install.ps1 | iexLoad a GGUF model and start the API
unsloth run unsloth/Qwen3.6-27B-GGUF
# or run Gemma 4
unsloth run unsloth/gemma-4-26B-A4B-it-GGUFAfter startup the terminal prints the listening address (usually http://localhost:8000 or 8888) and the API key ( sk‑unsloth‑…), which must be saved.
Hardware Requirements
Unsloth provides a benchmark table; the most practical models for a 24 GB GPU are:
Gemma 4 26B‑A4B (MoE) – needs 28–30 GB, best on M‑series Macs with 32 GB unified memory.
Gemma 4 E4B (dense) – fits in 9–12 GB, runs on 8 GB cards.
Qwen3.6‑27B – requires ~18 GB, comfortable on 24 GB cards.
Qwen3.6‑35B‑A3B (MoE) – needs ~23 GB; 24 GB cards are borderline, 30 GB is comfortable.
On an M4 Pro with 48 GB RAM the author runs Qwen3.6‑27B (Q4_K_XL) at 32 K context, achieving about 25 tokens/s, which is sufficient for coding tasks.
⚠️ CUDA 13.2 has a bug that produces garbled output for GGUF models; N‑card users should stay on 13.1 or 12.x until the fix lands.
Connecting Claude Code
Install Claude Code
curl -fsSL https://claude.ai/install.sh | bash
# or
brew install --cask claude-codePoint to the Unsloth endpoint
export ANTHROPIC_BASE_URL="http://localhost:8888"
export ANTHROPIC_API_KEY="sk‑unsloth‑your‑key"Disable the attribution header that slows inference by 90%:
cat > ~/.claude/settings.json <<'EOF'
{
"env": {
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
}
}
EOFConnecting Codex
Install Codex
brew install --cask codex
# or
npm install -g @openai/codexConfigure the client
[model_providers.unsloth]
name = "unsloth"
base_url = "http://localhost:8888/v1"
wire_api = "responses"
env_key = "UNSLOTH_API_KEY"
[profiles.local]
model_provider = "unsloth"
model = "Qwen3.6-27B-GGUF" export UNSLOTH_API_KEY="sk‑unsloth‑your‑key"
codex --profile localRetrieve the model ID with curl http://localhost:8888/v1/models and copy the id field.
Connecting OpenClaw
Install OpenClaw
curl -fsSL https://openclaw.ai/install.sh | bashEdit ~/.openclaw/openclaw.json to point to the Unsloth endpoint:
{
"models": {
"mode": "merge",
"providers": {
"unsloth": {
"baseUrl": "http://localhost:8888/v1",
"api": "anthropic-messages",
"authHeader": true,
"apiKey": "sk‑unsloth‑your‑key",
"models": [
{"id": "Qwen3.6-27B-GGUF", "name": "Qwen3.6 Local"}
]
}
}
}
}Note that baseUrl must end with /v1 and api set to anthropic-messages routes requests to /v1/messages.
Real‑World Experience
The author spent a morning using Qwen3.6‑27B in Claude Code and noted:
Cold start takes ~30 seconds; subsequent prompts respond instantly.
Self‑healing tool calls dramatically reduce JSON errors that previously caused crashes.
All code stays on‑device, eliminating privacy concerns.
Running locally is cheaper than paying for API tokens.
Drawbacks observed:
At 27 B parameters, the model still struggles with very long or complex refactoring tasks compared to Claude Sonnet 4 or GPT‑5.
Tool‑call latency is slightly higher than cloud services, especially web‑search.
Codex requires wire_api = "responses"; using the deprecated chat mode yields HTTP 400 errors.
Recommendations
Use the setup as a daily “co‑pilot” for scaffolding, bug fixing, testing and documentation. For heavyweight tasks such as large‑scale architecture design, fall back to top‑tier cloud APIs like Claude or GPT‑5 and mix as needed.
One More Thing
The main barrier to running local agents has been protocol fragmentation. Unsloth’s dual‑protocol endpoint removes that barrier by serving both OpenAI and Anthropic APIs on the same port and providing built‑in self‑healing, tool execution and web‑search capabilities.
Install once and all three CLIs work out of the box – that is the ideal experience for local agents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
