Qwopus3.5‑v3: From Reason‑Then‑Act to Act‑Then‑Refine – Claude‑Opus Distillation Turns Qwen3.5 into a Tool‑Using Agent
The newly released Qwopus3.5‑v3 model combines higher‑quality reasoning chains, dedicated tool‑calling reinforcement learning, and an act‑then‑refine paradigm, delivering a 5‑point HumanEval boost, a 1.43‑point MMLU‑Pro gain, 31.7% faster inference and 24% lower token cost, while remaining runnable on a 3090 or a 16 GB MacBook, with easy deployment via GGUF, LM Studio, Ollama or llama.cpp.
Core Change in v3
v1 taught Qwen to mimic Claude’s reasoning, v2 focused on faster, cheaper inference, and v3 adds the ability to use tools – shifting from “reason‑then‑act” to “act‑then‑refine”. This is a qualitative leap from pure thinking to actionable behavior.
Download Popularity Shows 9B Is the Sweet Spot
Jackrong released three sizes (4B, 9B, 27B) in nine variants. The 9B GGUF version leads with 10.9k downloads, about fifteen times the runner‑up, indicating that the 9B size balances capability and resource requirements.
Three Major Upgrades in v3
Structured Reasoning Optimization : v2 relied on third‑party distilled chains that could be “fake”. v3 trains on verified, process‑level reasoning chains, improving generalization and making the reasoning style explicit and step‑by‑step.
Tool‑Calling Reinforcement Learning : New RL training targets tool‑calling stability and accuracy, benefiting agent frameworks such as OpenClaw and enabling reliable multi‑step tasks like file operations or API calls.
“Act‑Then‑Refine” Paradigm : The model no longer expects a perfect first answer; it iteratively corrects its output, which is especially useful for complex, multi‑step problems.
Benchmark Results
HumanEval (9B) : Qwopus3.5‑9B‑v3 achieves 87.80% pass@1, roughly 5 percentage points higher than the original Qwen3.5‑9B and 4.87–5.49 points higher in the stricter Plus evaluation.
MMLU‑Pro : Accuracy rises from 80.36% (Qwen3.5‑9B) to 81.79%, a 1.43‑point improvement, fixing the 7.2% drop observed in v2.
Inference Efficiency : Average reasoning chain length drops from 7116 to 5313 characters (‑25.3%). Tokens per correct answer fall from 7938 to 6032 (‑24%). Throughput per 10 k characters rises from 1.26 to 1.66 (‑31.7% faster).
How to Run the Model
The GGUF format works with LM Studio, Ollama, or llama.cpp. Model address: Jackrong/Qwopus3.5-9B-v3-GGUF.
Mac: 16 GB RAM (MLX version recommended)
Windows/Linux: GPU with ≥8 GB VRAM (e.g., RTX 3060/4060)
Quantization: Q6 provides the best tool‑calling accuracy.
Example with Ollama:
# Download and run
ollama run hf.co/Jackrong/Qwopus3.5-9B-v3-GGUF:Q6_KLM Studio can also be used; the latest 0.4.9 version supports the model.
ToolCall‑15 Evaluation
ToolCall‑15 (github.com/stevibe/ToolCall-15) is a benchmark covering 15 scenarios across five capability categories (tool selection, parameter precision, multi‑step chaining, refusal, error recovery). The v3 model passes all 15 tests, matching the perfect score previously achieved only by the 27B version.
From “reason‑then‑act” to “act‑then‑refine”.
Summary
Across the three generations, Jackrong’s distilled series evolves from proving that small models can inherit large‑model reasoning (v1), to showing that inference efficiency can be dramatically improved (v2), and finally to demonstrating that distilled models can act as agents with robust tool‑calling (v3). The data—HumanEval +5 pp, MMLU‑Pro +1.43 pp, 31.7% faster inference, 24% lower token cost—support this claim. For users seeking a locally runnable model that writes code, calls tools, and stays resource‑friendly, Qwopus3.5‑9B‑v3 is currently the most compelling choice.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
