Artificial Intelligence 12 min read

Run Local LLM Agents on Claude Code, Codex and OpenClaw with Just 24 GB VRAM via Unsloth API

The article explains how Unsloth’s dual‑protocol API lets you run Claude Code, Codex and OpenClaw locally on a 24 GB GPU, details installation steps, hardware limits, configuration for each CLI, and shares real‑world performance pros and cons.

Old Zhang's AI Learning

May 9, 2026

Run Local LLM Agents on Claude Code, Codex and OpenClaw with Just 24 GB VRAM via Unsloth API

Overview

Unsloth has released a dual‑protocol API that combines OpenAI‑compatible and Anthropic‑compatible endpoints on a single port. By launching the service with one command, Claude Code, Codex, OpenClaw, OpenCode, Cursor and Cline can all connect to a locally hosted model without sending data to the cloud.

Unsloth API Details

The API sits on top of llama.cpp ’s llama-server and exposes three main routes: POST /v1/messages – Anthropic Messages API for Claude Code, Anthropic SDK, OpenClaw. POST /v1/chat/completions and /v1/responses – OpenAI‑compatible endpoints for OpenAI SDK, Codex, OpenCode, Cursor, Continue, Cline, Open WebUI. GET /v1/models – List currently loaded models.

Authentication mirrors OpenAI: include Authorization: Bearer sk‑unsloth‑… in the request header. The key is printed once when the server starts.

Key Features

Self‑healing tool calling : the server automatically fixes malformed tool‑call JSON before returning it, improving success rates.

Server‑side code execution : enable enable_tools: true and list enabled_tools (e.g., ["python", "bash"]) to run Bash or Python in a sandbox and stream results back.

Advanced web search : the model can fetch full web pages and read the main content, not just snippets.

Installation

Two steps are required: install Unsloth Studio and then install the Agent CLI you want to use.

Install Unsloth Studio (one line)

# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh
# Windows PowerShell
irm https://unsloth.ai/install.ps1 | iex

Load a GGUF model and start the API

unsloth run unsloth/Qwen3.6-27B-GGUF
# or run Gemma 4
unsloth run unsloth/gemma-4-26B-A4B-it-GGUF

After startup the terminal prints the listening address (usually http://localhost:8000 or 8888) and the API key ( sk‑unsloth‑…), which must be saved.

Hardware Requirements

Unsloth provides a benchmark table; the most practical models for a 24 GB GPU are:

Gemma 4 26B‑A4B (MoE) – needs 28–30 GB, best on M‑series Macs with 32 GB unified memory.

Gemma 4 E4B (dense) – fits in 9–12 GB, runs on 8 GB cards.

Qwen3.6‑27B – requires ~18 GB, comfortable on 24 GB cards.

Qwen3.6‑35B‑A3B (MoE) – needs ~23 GB; 24 GB cards are borderline, 30 GB is comfortable.

On an M4 Pro with 48 GB RAM the author runs Qwen3.6‑27B (Q4_K_XL) at 32 K context, achieving about 25 tokens/s, which is sufficient for coding tasks.

⚠️ CUDA 13.2 has a bug that produces garbled output for GGUF models; N‑card users should stay on 13.1 or 12.x until the fix lands.

Connecting Claude Code

Install Claude Code

curl -fsSL https://claude.ai/install.sh | bash
# or
brew install --cask claude-code

Point to the Unsloth endpoint

export ANTHROPIC_BASE_URL="http://localhost:8888"
export ANTHROPIC_API_KEY="sk‑unsloth‑your‑key"

Disable the attribution header that slows inference by 90%:

cat > ~/.claude/settings.json <<'EOF'
{
  "env": {
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
  }
}
EOF

Connecting Codex

Install Codex

brew install --cask codex
# or
npm install -g @openai/codex

Configure the client

[model_providers.unsloth]
name = "unsloth"
base_url = "http://localhost:8888/v1"
wire_api = "responses"
env_key = "UNSLOTH_API_KEY"

[profiles.local]
model_provider = "unsloth"
model = "Qwen3.6-27B-GGUF"

export UNSLOTH_API_KEY="sk‑unsloth‑your‑key"
codex --profile local

Retrieve the model ID with curl http://localhost:8888/v1/models and copy the id field.

Connecting OpenClaw

Install OpenClaw

curl -fsSL https://openclaw.ai/install.sh | bash

Edit ~/.openclaw/openclaw.json to point to the Unsloth endpoint:

{
  "models": {
    "mode": "merge",
    "providers": {
      "unsloth": {
        "baseUrl": "http://localhost:8888/v1",
        "api": "anthropic-messages",
        "authHeader": true,
        "apiKey": "sk‑unsloth‑your‑key",
        "models": [
          {"id": "Qwen3.6-27B-GGUF", "name": "Qwen3.6 Local"}
        ]
      }
    }
  }
}

Note that baseUrl must end with /v1 and api set to anthropic-messages routes requests to /v1/messages.

Real‑World Experience

The author spent a morning using Qwen3.6‑27B in Claude Code and noted:

Cold start takes ~30 seconds; subsequent prompts respond instantly.

Self‑healing tool calls dramatically reduce JSON errors that previously caused crashes.

All code stays on‑device, eliminating privacy concerns.

Running locally is cheaper than paying for API tokens.

Drawbacks observed:

At 27 B parameters, the model still struggles with very long or complex refactoring tasks compared to Claude Sonnet 4 or GPT‑5.

Tool‑call latency is slightly higher than cloud services, especially web‑search.

Codex requires wire_api = "responses"; using the deprecated chat mode yields HTTP 400 errors.

Recommendations

Use the setup as a daily “co‑pilot” for scaffolding, bug fixing, testing and documentation. For heavyweight tasks such as large‑scale architecture design, fall back to top‑tier cloud APIs like Claude or GPT‑5 and mix as needed.

One More Thing

The main barrier to running local agents has been protocol fragmentation. Unsloth’s dual‑protocol endpoint removes that barrier by serving both OpenAI and Anthropic APIs on the same port and providing built‑in self‑healing, tool execution and web‑search capabilities.

Install once and all three CLIs work out of the box – that is the ideal experience for local agents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Codex local inference Claude Code OpenClaw Unsloth 24GB VRAM

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.