Artificial Intelligence 9 min read

Gemma‑4‑12B‑v2 (Fable 5 Clone) Achieves 3.5× Telecom Benchmark Boost

The author reproduces Anthropic’s Fable 5 using Gemma‑4‑12B‑v2, showing a 3.5× improvement on the telecom tau2‑bench versus the base model, details the agentic, coding, and general training data, compares quantization sizes, provides llama.cpp launch commands, and notes speed gains from speculative MTP decoding and current limitations.

Old Zhang's AI Learning

Jun 19, 2026

Gemma‑4‑12B‑v2 (Fable 5 Clone) Achieves 3.5× Telecom Benchmark Boost

Model Overview

New version

yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF

builds on the Gemma 4 12B base, fine‑tuned with Fable 5 and Composer 2.5 data (v2 tag) and claims a 3.5× improvement on the tau2‑bench.

Core Benchmark Data

Local comparison (same harness, Q8_0 quant, 20 tasks) shows: gemma-4-12B-it (official base) – ~15% tau2‑bench telecom score

🟢 Gemma‑4‑12B v2 (this model) – ~55% score

Telecom was chosen because its loop (read → diagnose → fix → verify) mirrors a real terminal debugging scenario, providing a stronger signal for agentic ability than retail‑style tasks.

Failure modes differ:

Base model often hands off to a human via transfer_to_human in 10 of 20 tasks.

v2 never hands off; it persists in the loop, solving more tasks.

State‑of‑the‑art models such as mimo‑v2.5‑pro and Claude Opus 4.8 score >90% on this benchmark but are orders of magnitude larger.

Three caveats:

MMLU‑Pro general‑knowledge score drops slightly – a natural trade‑off of focused fine‑tuning.

tau2‑bench retail tasks underperform the base model – retail is pure customer‑service lookup, misaligned with this model’s focus.

Local self‑scores are not directly comparable to official leaderboards – they are relative within the same harness.

Training Data Composition

🛠️ Agentic / terminal – multi‑step tool‑use trajectories (read → reason → act → verify) using Gemma 4’s native tool‑call protocol; primary driver of the tau2‑bench gain.

💻 Coding – Composer 2.5’s true chain‑of‑thought (teacher solution + code that passes tests) plus a “redo” set of hard Fable 5 problems.

📚 General – a small mix of general, reasoning, and instruction data to retain breadth.

Quantization Options

🟡 Q3_K_M – 5.7 GB, suitable for 8 GB VRAM users.

🔵 Q4_K_M – 6.87 GB, recommended starting point.

🟣 Q6_K – 9.11 GB, near‑lossless quality.

⚪ Q8_0 – 11.8 GB, almost full‑precision.

Running the Model (llama.cpp)

Use the latest llama.cpp build that recognises the gemma4_unified architecture.

llama-server.exe ^
  -m gemma4-v2-Q4_K_M.gguf ^
  --ctx-size 16384 ^
  --n-gpu-layers 99 ^
  --no-mmap -fa on ^
  --jinja ^
  --temp 1.0 --top-p 0.95 --top-k 64 ^
  --host 0.0.0.0 --port 18080

Common pitfalls:

Garbage output or runaway generation – add rep_pen 1.1 and keep temp 1.0.

Raw tokens like <|tool_call> or <|channel> appear – enable --jinja to parse Gemma 4’s native tool format.

One‑click apps (LM Studio, Jan, Ollama) can import the GGUF directly, select a quant, and run without command‑line work.

Agent Mode Usage

Through the OpenAI‑compatible tools field, define tool calls (with --jinja enabled). The model emits structured calls and operates in read/grep/edit/run loops automatically.

Sampling recommendation: temp 1.0, top_p 0.95, top_k 64. For deterministic code generation, use greedy sampling ( temp 0).

MTP Speculative Decoding

The repository includes an MTP/ folder containing a draft Gemma 4 multi‑token‑prediction model (converted to GGUF via unsloth) for speculative decoding.

Author’s measurements:

~88 tok/s → ~180 tok/s on a deterministic prompt.

Real coding/thinking scenarios see ~1.2‑1.3× acceleration.

Lossless speed‑up without precision loss.

Loader compatibility: works with llama.cpp commit 9e3b928fd (b9553). Newer commits b9702/b9717 crash with “invalid vector subscript” – an upstream regression.

Future Work

v3 is under development; the author expects telecom scores in the 60‑70% range.

A Qwen 3.6‑27B variant is being trained with the same coding + agentic recipe for users with larger VRAM budgets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Quantization speculative decoding agentic AI llama.cpp Gemma-4-12B Fable 5 telecom benchmark

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.