Opus‑Distilled Qwen3.5‑Coder Scores 100/100 Tool Calls, 1.4‑2.2× Faster with MTP, 128K Context on Consumer GPU
The article introduces Qwopus3.5‑4B‑Coder‑MTP‑GGUF, a 4‑billion‑parameter agent model fine‑tuned for code debugging, tool calling, and structured reasoning, explains its novel Trace Inversion, high‑quality trajectory data, and Curriculum SFT training, details MTP acceleration, benchmark results, quantization options, and step‑by‑step local deployment instructions.
Introduction
Qwopus3.5‑4B‑Coder‑MTP‑GGUF is a 4 B‑parameter agent model derived from Qwen3.5, optimized for three scenarios: code debugging, tool calling, and structured reasoning. Its small size allows it to run on an 8 GB notebook without requiring A100 GPUs or cloud services.
Training Methodology
Trace Inversion : learns from execution traces by reverse‑learning the reasoning path.
High‑Quality Agent Trajectory Data : directly imitates the behavior patterns of strong agents.
Curriculum SFT : applies curriculum‑style fine‑tuning in a progressive manner.
MTP Acceleration
The core idea of Multi‑Token Prediction (MTP) is to predict the current token and the next token simultaneously, effectively giving the model a two‑step look‑ahead. Unlike speculative decoding, MTP is baked into the model during training and does not need a separate draft model. In code‑heavy tasks, the high repeatability of patterns leads to a high prediction hit rate, yielding a 1.4‑2.2× speedup without accuracy loss.
Supported inference engines include llama.cpp, vLLM, and MLX.
Benchmark Results (benchlocal)
Four dimensions were evaluated:
BugFind : 71/100 (+19), demonstrating the effectiveness of Trace Inversion for debugging logic.
ToolCall : 100/100 (+10), indicating perfect format compliance for tool calls.
HermesAgent : 64/100 (+3), a modest gain in multi‑turn memory and workspace management.
InstructFollow : 93/100 (0 change), showing that fine‑tuning did not degrade the base model’s instruction‑following ability.
Agent Workflow
The execution chain is: user instruction → <think> (reasoning) → tool call → result retrieval → answer generation, with possible multi‑turn loops. The benchmark dimensions map directly to these stages.
Quantization Formats (GGUF)
GGUF provides quantization from 2‑bit to 16‑bit. Recommended default is Q4_K_M (2.78 GB) for a good balance of size, speed, and minimal accuracy loss. Larger formats (Q5_K_M, Q8_0, BF16) are reserved for precision‑critical tasks.
Typical selection guidance is shown in the accompanying diagram.
Installation and Usage
llama.cpp (recommended for local deployment) :
./llama-server \
-m Qwopus3.5-4B-Coder-MTP-Q4_K_M.gguf \
--ctx-size 131072 \
--rope-scaling yarn \
--rope-scale 4 \
--yarn-orig-ctx 32768Note: although the model was trained with a 32 K context, YaRN RoPE scaling enables 128 K (or even 256 K) contexts. The --rope-scaling yarn flag is essential; changing only --ctx-size leads to corrupted long‑context output.
vLLM :
pip install vllm
vllm serve "Jackrong/Qwopus3.5-4B-Coder-MTP-GGUF"Transformers (Python example):
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="Jackrong/Qwopus3.5-4B-Coder-MTP-GGUF")
messages = [{"role": "user", "content": [{"type": "text", "text": "帮我 debug 这段代码..."}]}]
result = pipe(text=messages)Local testing was performed on a Mac Mini with 16 GB RAM using the Q4_K_M quantization.
Context Extension
128 K: --rope-scale 4 --yarn-orig-ctx 32768 256 K: theoretically possible, but 128 K is recommended for stability.
For most agent workflows, 128 K context is sufficient.
Suitable and Unsuitable Scenarios
Ideal for:
Running agent workflows locally without API costs.
Stable tool‑calling capability in a small model.
Notebook or Mac development with 8 GB memory.
Code‑debug assistance via editor plugins.
Simple repetitive automation tasks.
Not ideal for:
Complex tasks requiring strong reasoning power (only 4 B parameters).
Long‑form text generation.
High‑concurrency production services.
Conclusion
Qwopus3.5‑4B‑Coder‑MTP‑GGUF is a purpose‑built small model that excels in agent‑coding scenarios, achieving perfect tool‑call scores and leading bug‑finding performance. Combined with MTP acceleration and full GGUF quantization, it offers a practical solution for developers who want to run agents locally on consumer‑grade hardware.
Being a community model, it has not undergone a full safety audit; it may emit <think> tags that front‑ends need to filter. The model is released under Apache‑2.0, allowing free commercial use for learning and local experimentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
