Opus‑Distilled Qwen3.5‑Coder Scores 100/100 Tool Calls, 1.4‑2.2× Faster with MTP, 128K Context on Consumer GPU

The article introduces Qwopus3.5‑4B‑Coder‑MTP‑GGUF, a 4‑billion‑parameter agent model fine‑tuned for code debugging, tool calling, and structured reasoning, explains its novel Trace Inversion, high‑quality trajectory data, and Curriculum SFT training, details MTP acceleration, benchmark results, quantization options, and step‑by‑step local deployment instructions.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Opus‑Distilled Qwen3.5‑Coder Scores 100/100 Tool Calls, 1.4‑2.2× Faster with MTP, 128K Context on Consumer GPU

Introduction

Qwopus3.5‑4B‑Coder‑MTP‑GGUF is a 4 B‑parameter agent model derived from Qwen3.5, optimized for three scenarios: code debugging, tool calling, and structured reasoning. Its small size allows it to run on an 8 GB notebook without requiring A100 GPUs or cloud services.

Training Methodology

Trace Inversion : learns from execution traces by reverse‑learning the reasoning path.

High‑Quality Agent Trajectory Data : directly imitates the behavior patterns of strong agents.

Curriculum SFT : applies curriculum‑style fine‑tuning in a progressive manner.

MTP Acceleration

The core idea of Multi‑Token Prediction (MTP) is to predict the current token and the next token simultaneously, effectively giving the model a two‑step look‑ahead. Unlike speculative decoding, MTP is baked into the model during training and does not need a separate draft model. In code‑heavy tasks, the high repeatability of patterns leads to a high prediction hit rate, yielding a 1.4‑2.2× speedup without accuracy loss.

Supported inference engines include llama.cpp, vLLM, and MLX.

Benchmark Results (benchlocal)

Four dimensions were evaluated:

BugFind : 71/100 (+19), demonstrating the effectiveness of Trace Inversion for debugging logic.

ToolCall : 100/100 (+10), indicating perfect format compliance for tool calls.

HermesAgent : 64/100 (+3), a modest gain in multi‑turn memory and workspace management.

InstructFollow : 93/100 (0 change), showing that fine‑tuning did not degrade the base model’s instruction‑following ability.

Agent Workflow

The execution chain is: user instruction → <think> (reasoning) → tool call → result retrieval → answer generation, with possible multi‑turn loops. The benchmark dimensions map directly to these stages.

Quantization Formats (GGUF)

GGUF provides quantization from 2‑bit to 16‑bit. Recommended default is Q4_K_M (2.78 GB) for a good balance of size, speed, and minimal accuracy loss. Larger formats (Q5_K_M, Q8_0, BF16) are reserved for precision‑critical tasks.

Typical selection guidance is shown in the accompanying diagram.

Installation and Usage

llama.cpp (recommended for local deployment) :

./llama-server \
  -m Qwopus3.5-4B-Coder-MTP-Q4_K_M.gguf \
  --ctx-size 131072 \
  --rope-scaling yarn \
  --rope-scale 4 \
  --yarn-orig-ctx 32768

Note: although the model was trained with a 32 K context, YaRN RoPE scaling enables 128 K (or even 256 K) contexts. The --rope-scaling yarn flag is essential; changing only --ctx-size leads to corrupted long‑context output.

vLLM :

pip install vllm
vllm serve "Jackrong/Qwopus3.5-4B-Coder-MTP-GGUF"

Transformers (Python example):

from transformers import pipeline
pipe = pipeline("image-text-to-text", model="Jackrong/Qwopus3.5-4B-Coder-MTP-GGUF")
messages = [{"role": "user", "content": [{"type": "text", "text": "帮我 debug 这段代码..."}]}]
result = pipe(text=messages)

Local testing was performed on a Mac Mini with 16 GB RAM using the Q4_K_M quantization.

Context Extension

128 K: --rope-scale 4 --yarn-orig-ctx 32768 256 K: theoretically possible, but 128 K is recommended for stability.

For most agent workflows, 128 K context is sufficient.

Suitable and Unsuitable Scenarios

Ideal for:

Running agent workflows locally without API costs.

Stable tool‑calling capability in a small model.

Notebook or Mac development with 8 GB memory.

Code‑debug assistance via editor plugins.

Simple repetitive automation tasks.

Not ideal for:

Complex tasks requiring strong reasoning power (only 4 B parameters).

Long‑form text generation.

High‑concurrency production services.

Conclusion

Qwopus3.5‑4B‑Coder‑MTP‑GGUF is a purpose‑built small model that excels in agent‑coding scenarios, achieving perfect tool‑call scores and leading bug‑finding performance. Combined with MTP acceleration and full GGUF quantization, it offers a practical solution for developers who want to run agents locally on consumer‑grade hardware.

Being a community model, it has not undergone a full safety audit; it may emit <think> tags that front‑ends need to filter. The model is released under Apache‑2.0, allowing free commercial use for learning and local experimentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

QuantizationagentBenchmarkLocal DeploymentMTPGGUFQwen3.5
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.