Qwen3.5-27B Outperforms the 397B Model in Tool Calling – Q6 Quantization Is Optimal

Using the open‑source ToolCall‑15 benchmark, the author shows that the 27‑billion‑parameter Qwen3.5 model consistently scores full marks while the 397‑billion‑parameter version fails on several tasks, and that the Q6 quantized variant offers the best trade‑off between size and tool‑calling accuracy.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Qwen3.5-27B Outperforms the 397B Model in Tool Calling – Q6 Quantization Is Optimal

Tool calling (function calling) is the key capability that lets large language models move from mere chat to practical work, and without it an agent is useless. To evaluate this ability the author uses the open‑source ToolCall-15 benchmark (github.com/stevibe/ToolCall-15), which runs every model through 15 real‑world scenarios covering five capability categories, 12 tools, deterministic simulated responses, temperature set to 0, and no selective test skipping.

15 scenarios, 5 capability groups (3 per group)

12 tools, all visible to the model each run

Deterministic simulated tool responses

Temperature fixed at 0 to remove randomness

All tests executed, no cherry‑picking

The benchmark results for the entire Qwen3.5 family are:

Qwen3.5‑27B : 15/15 (full score)

Qwen3.5‑27B distilled : 15/15 (full score, tool‑calling ability preserved)

Qwen3.5‑397B : 13/15 (two failures)

Qwen3.5‑122B : 14/15 (one failure)

Qwen3.5‑35B : 13/15 (two failures)

Small models (0.8B‑14B): many time‑outs and infinite tool‑calling loops

The most illustrative case is scenario 15 (TC‑15): “Search the population of Iceland and calculate 2 %.” Small models either fabricate a number or loop until timeout; larger models retrieve the exact figure (372,520) but ignore it and use an approximate value (~370,000); the 27B model correctly uses the retrieved number and computes 372520 * 0.02 = 7450.4, achieving a perfect score.

After confirming 27B as the best‑performing size, the author examines quantization. All Unsloth quantized variants from Q2_K_XL to Q8_K_XL are tested. Both Q8 and Q6 reach 15/15, but Q6 uses less memory and runs faster. Lower quantization levels (Q5, Q4, Q3) drop to 14/15, and Q2 falls to 13/15, showing a roughly linear degradation of tool‑calling precision as quantization becomes more aggressive.

Temperature 0 is justified: Databricks research indicates that for function‑calling tasks, accuracy can differ by up to 10 % between temperature 0 and 0.7. Tool calling requires structured, deterministic output (correct tool, correct parameters, correct format), similar to generating JSON or code with low temperature.

In summary, the 27‑billion‑parameter Qwen3.5 model is compact, fast, and excels at tool calling, while the Q6 quantized version provides the optimal balance of size, speed, and accuracy.

ToolCall-15 testing dashboard
ToolCall-15 testing dashboard
Qwen3.5-27B quantization versions
Qwen3.5-27B quantization versions
Quantization version comparison
Quantization version comparison
AImodel evaluationTool CallingQwen3.5LLM benchmark
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.