Artificial Intelligence 12 min read

Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code

The article evaluates the GGUF‑quantized Claude‑Opus‑4.6 distilled Qwen3.5 9B model on a 16 GB Mac Mini M4 using LM Studio, detailing model sizes, performance metrics, deployment steps, API integration with Claude Code, and concluding that while the 9B version is usable, its capabilities remain limited compared to larger models.

Old Zhang's AI Learning

Mar 16, 2026

Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code

Model selection and motivation

Jackrong's GGUF‑quantized Claude‑Opus‑4.6 distilled Qwen3.5 model (2B and 9B) was run on a 16 GB Mac Mini M4 with LM Studio because the model topped HuggingFace’s hot‑list.

Why GGUF + LM Studio

GGUF reduces model file size (e.g., a 27B model from >50 GB to ~16 GB) and works on CPU and consumer‑grade GPUs. LM Studio provides a GUI, one‑click download, OpenAI/Anthropic‑compatible API, CLI, and remote LM Link, enabling low‑threshold testing.

Quantization variants

Q2_K – 10.1 GB file, ~12 GB VRAM, extreme memory saving, lower precision.

Q3_K_S – 12.1 GB, ~14 GB VRAM, compromise when memory is tight.

Q3_K_M – 13.3 GB, ~15 GB VRAM, best precision within Q3.

Q4_K_S – 15.6 GB, ~17 GB VRAM, cost‑effective choice.

Q4_K_M (recommended) – 16.5 GB, ~18 GB VRAM, optimal balance of accuracy and size.

Q8_0 – 28.6 GB, ~30 GB VRAM, high‑precision when VRAM is abundant.

Community benchmark on RTX 3090 (24 GB)

VRAM usage ≈ 16.5 GB.

Generation speed 29–35 tokens / s.

Full 262 K context window retained.

Crash caused by Jinja template using the developer role was fixed.

Training and distillation details

Data: ~3,280 high‑quality Claude Opus 4.6 reasoning examples plus supplemental data from TeichAI and Jackrong.

Training strategy: train_on_responses_only – loss computed only on the <think> reasoning segment and final answer, encouraging the model to mimic Claude’s structured chain‑of‑thought.

Fine‑tuning: Unsloth + LoRA with Rank = 64, described as highly efficient.

Model behavior

The distilled model wraps its reasoning in <think>…</think> tags, e.g.:

<think>
Let me analyze this request carefully:
1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step‑by‑step solution plan.
5. Execute the reasoning sequentially and verify consistency...
</think>

This chain‑of‑thought pattern yields higher reasoning efficiency compared with the original Qwen3.5‑27B, which can “loop” on simple problems.

In AI‑code‑agent scenarios (Claude Code, OpenCode) the model runs continuously for over 9 minutes, automatically reading errors, fixing code, and writing README without crashing.

Observed limitations

Generation speed on the author’s test suite was about 13 tokens / s, slower than the RTX 3090 benchmark.

Occasional failure to locate or invoke required skills , leading to unnecessary web‑search attempts even when local resources exist.

Overall capability of the 9B version is limited; the 27B variant provides a markedly better experience.

LM Studio deployment steps

Step 1 – Download model

Search for Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF in LM Studio or use the CLI:

lms get Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF --file Qwen3.5-9B.Q4_K_M.gguf

Alternatively, download with huggingface-cli or modelscope to the LM Studio model directory.

Step 2 – Load model

After download, select the model in LM Studio and adjust:

GPU Offload : enable fully if a dedicated GPU is present.

Context Length : start with 8192 or 16384; the author used 262144.

Max Concurrent Predictions : keep the default of 1.

Step 3 – Chat

Open the chat window; the model automatically wraps its reasoning in <think>…</think> tags.

Advanced: Local API server

Enable the Developer tab, start the server (default port 1234), then call it with the OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lmstudio")
response = client.chat.completions.create(
    model="Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF",
    messages=[{"role": "user", "content": "用Python写一个LRU缓存，要求线程安全"}],
    temperature=0.6,
    top_p=0.95,
    max_tokens=8192,
)
print(response.choices[0].message.content)

For Claude Code integration, set the following environment variables:

ANTHROPIC_AUTH_TOKEN=lm-studio-local
ANTHROPIC_BASE_URL=http://localhost:1234
ANTHROPIC_DEFAULT_HAIKU_MODEL=qwen3.5-9b-claude-4.6-opus-reasoning-distilled
ANTHROPIC_DEFAULT_OPUS_MODEL=qwen3.5-9b-claude-4.6-opus-reasoning-distilled
ANTHROPIC_DEFAULT_SONNET_MODEL=qwen3.5-9b-claude-4.6-opus-reasoning-distilled
ANTHROPIC_MODEL=qwen3.5-9b-claude-4.6-opus-reasoning-distilled

Tool‑calling example

Prompt:

请在当前目录下创建一个名为 debug_test.py 的 Python 脚本，代码内容是计算 1 到 10 的平均值，但请在代码中故意留下一个逻辑错误（比如除以 或者变量名写错）。接着运行这个脚本，捕获错误日志，分析原因并自动修复它，最后再次运行以确保输出正确的平均值

The model successfully executed a sequence of actions (write_file, run_shell_command, read_file, write_file, run_shell_command) without errors.

Conclusion

The 9B GGUF‑quantized Claude‑Opus‑4.6 distilled Qwen3.5 model runs on modest hardware (e.g., 24 GB GPU or ≥32 GB unified memory on Macs) with a 262 K context window and tool‑calling ability, but users should expect occasional slowdowns and incomplete skill handling. For more demanding workloads, the larger 27B model is recommended.

model quantization LM Studio local LLM deployment Claude Opus GGUF Qwen3.5

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Model selection and motivation

Why GGUF + LM Studio

Quantization variants

Community benchmark on RTX 3090 (24 GB)

Training and distillation details

Model behavior

Observed limitations

LM Studio deployment steps

Step 1 – Download model

Step 2 – Load model

Step 3 – Chat

Advanced: Local API server

Tool‑calling example

Conclusion

Old Zhang's AI Learning

How this landed with the community

Was this worth your time?

0 Comments

Why GGUF + LM Studio

Community benchmark on RTX 3090 (24 GB)

LM Studio deployment steps

Step 1 – Download model

Step 2 – Load model

Step 3 – Chat