Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code
The article evaluates the GGUF‑quantized Claude‑Opus‑4.6 distilled Qwen3.5 9B model on a 16 GB Mac Mini M4 using LM Studio, detailing model sizes, performance metrics, deployment steps, API integration with Claude Code, and concluding that while the 9B version is usable, its capabilities remain limited compared to larger models.
Model selection and motivation
Jackrong's GGUF‑quantized Claude‑Opus‑4.6 distilled Qwen3.5 model (2B and 9B) was run on a 16 GB Mac Mini M4 with LM Studio because the model topped HuggingFace’s hot‑list.
Why GGUF + LM Studio
GGUF reduces model file size (e.g., a 27B model from >50 GB to ~16 GB) and works on CPU and consumer‑grade GPUs. LM Studio provides a GUI, one‑click download, OpenAI/Anthropic‑compatible API, CLI, and remote LM Link, enabling low‑threshold testing.
Quantization variants
Q2_K – 10.1 GB file, ~12 GB VRAM, extreme memory saving, lower precision.
Q3_K_S – 12.1 GB, ~14 GB VRAM, compromise when memory is tight.
Q3_K_M – 13.3 GB, ~15 GB VRAM, best precision within Q3.
Q4_K_S – 15.6 GB, ~17 GB VRAM, cost‑effective choice.
Q4_K_M (recommended) – 16.5 GB, ~18 GB VRAM, optimal balance of accuracy and size.
Q8_0 – 28.6 GB, ~30 GB VRAM, high‑precision when VRAM is abundant.
Community benchmark on RTX 3090 (24 GB)
VRAM usage ≈ 16.5 GB.
Generation speed 29–35 tokens / s.
Full 262 K context window retained.
Crash caused by Jinja template using the developer role was fixed.
Training and distillation details
Data: ~3,280 high‑quality Claude Opus 4.6 reasoning examples plus supplemental data from TeichAI and Jackrong.
Training strategy: train_on_responses_only – loss computed only on the <think> reasoning segment and final answer, encouraging the model to mimic Claude’s structured chain‑of‑thought.
Fine‑tuning: Unsloth + LoRA with Rank = 64, described as highly efficient.
Model behavior
The distilled model wraps its reasoning in <think>…</think> tags, e.g.:
<think>
Let me analyze this request carefully:
1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step‑by‑step solution plan.
5. Execute the reasoning sequentially and verify consistency...
</think>This chain‑of‑thought pattern yields higher reasoning efficiency compared with the original Qwen3.5‑27B, which can “loop” on simple problems.
In AI‑code‑agent scenarios (Claude Code, OpenCode) the model runs continuously for over 9 minutes, automatically reading errors, fixing code, and writing README without crashing.
Observed limitations
Generation speed on the author’s test suite was about 13 tokens / s, slower than the RTX 3090 benchmark.
Occasional failure to locate or invoke required skills , leading to unnecessary web‑search attempts even when local resources exist.
Overall capability of the 9B version is limited; the 27B variant provides a markedly better experience.
LM Studio deployment steps
Step 1 – Download model
Search for Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF in LM Studio or use the CLI:
lms get Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF --file Qwen3.5-9B.Q4_K_M.ggufAlternatively, download with huggingface-cli or modelscope to the LM Studio model directory.
Step 2 – Load model
After download, select the model in LM Studio and adjust:
GPU Offload : enable fully if a dedicated GPU is present.
Context Length : start with 8192 or 16384; the author used 262144.
Max Concurrent Predictions : keep the default of 1.
Step 3 – Chat
Open the chat window; the model automatically wraps its reasoning in <think>…</think> tags.
Advanced: Local API server
Enable the Developer tab, start the server (default port 1234), then call it with the OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lmstudio")
response = client.chat.completions.create(
model="Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF",
messages=[{"role": "user", "content": "用Python写一个LRU缓存,要求线程安全"}],
temperature=0.6,
top_p=0.95,
max_tokens=8192,
)
print(response.choices[0].message.content)For Claude Code integration, set the following environment variables:
ANTHROPIC_AUTH_TOKEN=lm-studio-local
ANTHROPIC_BASE_URL=http://localhost:1234
ANTHROPIC_DEFAULT_HAIKU_MODEL=qwen3.5-9b-claude-4.6-opus-reasoning-distilled
ANTHROPIC_DEFAULT_OPUS_MODEL=qwen3.5-9b-claude-4.6-opus-reasoning-distilled
ANTHROPIC_DEFAULT_SONNET_MODEL=qwen3.5-9b-claude-4.6-opus-reasoning-distilled
ANTHROPIC_MODEL=qwen3.5-9b-claude-4.6-opus-reasoning-distilledTool‑calling example
Prompt:
请在当前目录下创建一个名为 debug_test.py 的 Python 脚本,代码内容是计算 1 到 10 的平均值,但请在代码中故意留下一个逻辑错误(比如除以 或者变量名写错)。接着运行这个脚本,捕获错误日志,分析原因并自动修复它,最后再次运行以确保输出正确的平均值The model successfully executed a sequence of actions (write_file, run_shell_command, read_file, write_file, run_shell_command) without errors.
Conclusion
The 9B GGUF‑quantized Claude‑Opus‑4.6 distilled Qwen3.5 model runs on modest hardware (e.g., 24 GB GPU or ≥32 GB unified memory on Macs) with a 262 K context window and tool‑calling ability, but users should expect occasional slowdowns and incomplete skill handling. For more demanding workloads, the larger 27B model is recommended.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
