Which Small Language Model Wins After Fine‑Tuning? A Data‑Driven Benchmark
A comprehensive benchmark fine‑tunes twelve small language models on eight diverse tasks, compares them against a 120B teacher model, and reveals which models excel overall, which are most "plastic" for improvement, and how small models can rival much larger ones.
Building AI applications that run on edge, local, or device‑side hardware often raises the question: which small language model (SLM) should be fine‑tuned? This article answers that by benchmarking twelve SLMs across eight tasks and comparing them with a 120B teacher model.
TL;DR
Fine‑tuned small models can surpass large models: Qwen3‑4B matches or outperforms the 30× larger GPT‑OSS‑120B on seven of eight benchmarks, and even beats it by 19 points on SQuAD 2.0.
Best overall fine‑tuned model: Qwen3‑4B‑Instruct‑2507 consistently ranks first.
Most "plastic" (largest fine‑tuning gain): The smallest 1‑3B models gain the most relative improvement, narrowing the gap with larger models.
Introduction
We fine‑tuned 12 models (Qwen3, Llama‑3, SmolLM2, Gemma, Granite) on eight tasks covering classification, information extraction, and open/closed‑book QA, then compared them with a synthetic‑data teacher model (GPT‑OSS‑120B).
The study addresses four practical questions:
Which model is strongest after fine‑tuning?
Which model is most "plastic" (largest fine‑tuning gain)?
Which model has the best zero‑/few‑shot baseline?
Can the best student model catch up to the teacher?
Method
Models evaluated:
Qwen3 series: 8B, 4B‑Instruct‑2507, 1.7B, 0.6B (thought mode disabled)
Llama series: 8B‑Instruct, 3B‑Instruct, 1B‑Instruct
SmolLM2 series: 1.7B‑Instruct, 135M‑Instruct
Gemma series: 3‑1B‑it, 3‑270M‑it
Granite: 3.3‑8B‑Instruct
Metrics:
Baseline score: zero‑shot few‑shot performance using only prompts.
Fine‑tuned score: performance after training on 10 k synthetic examples generated by the teacher model.
The eight benchmarks include classification (TREC, Banking77, Ecommerce, Mental Health), document understanding (docs), and QA (HotpotQA, Roman Empire QA, SQuAD 2.0). Rankings are computed per task, averaged, and reported with 95 % confidence intervals (lower rank = better).
Question 1: Which model is strongest after fine‑tuning?
Champion: Qwen3‑4B‑Instruct‑2507 (average rank 2.25 ± 1.03).
Qwen3 family dominates; the 4B version even outperforms the 8B variant, indicating the July 25 update improves distillation performance.
Question 2: Which model is most "plastic"?
Champion: Llama‑3.2‑1B‑Instruct (average rank 3.44 ± 1.31).
Plasticity (fine‑tuned gain – baseline) is highest for the smallest models; they gain the most relative improvement, effectively narrowing the gap with larger models.
Question 3: Which model has the best baseline?
Champion: Qwen3‑8B (average rank 1.75 ± 0.72).
Without any fine‑tuning, the 8B model consistently ranks near the top with the smallest variance, making it the most reliable out‑of‑the‑box performer.
Question 4: Can the best student catch up to the teacher?
Answer: Yes. Qwen3‑4B‑Instruct‑2507 matches or exceeds the teacher on seven of eight benchmarks, tying on one and lagging slightly on Banking77 (within confidence bounds). Notably, it outperforms the teacher by 19 points on SQuAD 2.0.
Across the eight tasks, the 4B student surpasses the 120B+ teacher on six, ties on one, and is marginally behind on one (Banking77, within error margin). The 19‑point lead on SQuAD 2.0 demonstrates that fine‑tuning can embed domain knowledge into a small model.
Practical Model Selection Table
Goal: Highest accuracy – Choose Qwen3‑4B‑Instruct‑2507 (best overall after fine‑tuning).
Very tight compute (<2B) – Choose Llama‑3.2‑1B or Qwen3‑0.6B (highest plasticity).
Cannot fine‑tune – Choose Qwen3‑8B (strongest zero‑/few‑shot).
Edge deployment (mobile/IoT) – Choose Qwen3‑0.6B (smallest size, still plastic).
Next Steps
Expand model list: add upcoming SLMs such as Qwen3.5, Phi‑4, Mistral.
Increase benchmark repetitions to shrink confidence intervals.
Include additional tasks like summarization, code generation, and multi‑turn dialogue.
Training Details
All models were fine‑tuned with the same distillation pipeline: the teacher (GPT‑OSS‑120B) generated 10 k synthetic examples per task; training used 4 epochs, learning rate 5e‑5, linear decay, LoRA rank 64. Training and test sets were fully disjoint.
Conclusion
Base model quality varies, but fine‑tuning quickly narrows the gap. The benchmark shows Qwen3‑4B‑Instruct‑2507 is overall strongest and can achieve near‑teacher performance on a single consumer‑grade GPU with roughly 1/30 the inference cost, while very small models (e.g., Llama‑3.2‑1B) achieve remarkable gains due to high plasticity.
One‑liner: Fine‑tuning matters more than the choice of base model – a well‑tuned 1B model can outshine an 8B model that relies only on prompting.
https://www.distillabs.ai/blog/we-benchmarked-12-small-language-models-across-8-tasks-to-find-the-best-base-model-for-fine-tuningSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
