Which Small Language Model Wins After Fine‑Tuning? A Data‑Driven Benchmark

A comprehensive benchmark fine‑tunes twelve small language models on eight diverse tasks, compares them against a 120B teacher model, and reveals which models excel overall, which are most "plastic" for improvement, and how small models can rival much larger ones.

PaperAgent
PaperAgent
PaperAgent
Which Small Language Model Wins After Fine‑Tuning? A Data‑Driven Benchmark

Building AI applications that run on edge, local, or device‑side hardware often raises the question: which small language model (SLM) should be fine‑tuned? This article answers that by benchmarking twelve SLMs across eight tasks and comparing them with a 120B teacher model.

TL;DR

Fine‑tuned small models can surpass large models: Qwen3‑4B matches or outperforms the 30× larger GPT‑OSS‑120B on seven of eight benchmarks, and even beats it by 19 points on SQuAD 2.0.

Best overall fine‑tuned model: Qwen3‑4B‑Instruct‑2507 consistently ranks first.

Most "plastic" (largest fine‑tuning gain): The smallest 1‑3B models gain the most relative improvement, narrowing the gap with larger models.

Introduction

We fine‑tuned 12 models (Qwen3, Llama‑3, SmolLM2, Gemma, Granite) on eight tasks covering classification, information extraction, and open/closed‑book QA, then compared them with a synthetic‑data teacher model (GPT‑OSS‑120B).

The study addresses four practical questions:

Which model is strongest after fine‑tuning?

Which model is most "plastic" (largest fine‑tuning gain)?

Which model has the best zero‑/few‑shot baseline?

Can the best student model catch up to the teacher?

Method

Models evaluated:

Qwen3 series: 8B, 4B‑Instruct‑2507, 1.7B, 0.6B (thought mode disabled)

Llama series: 8B‑Instruct, 3B‑Instruct, 1B‑Instruct

SmolLM2 series: 1.7B‑Instruct, 135M‑Instruct

Gemma series: 3‑1B‑it, 3‑270M‑it

Granite: 3.3‑8B‑Instruct

Metrics:

Baseline score: zero‑shot few‑shot performance using only prompts.

Fine‑tuned score: performance after training on 10 k synthetic examples generated by the teacher model.

The eight benchmarks include classification (TREC, Banking77, Ecommerce, Mental Health), document understanding (docs), and QA (HotpotQA, Roman Empire QA, SQuAD 2.0). Rankings are computed per task, averaged, and reported with 95 % confidence intervals (lower rank = better).

Question 1: Which model is strongest after fine‑tuning?

Champion: Qwen3‑4B‑Instruct‑2507 (average rank 2.25 ± 1.03).

Qwen3 family dominates; the 4B version even outperforms the 8B variant, indicating the July 25 update improves distillation performance.

Question 2: Which model is most "plastic"?

Champion: Llama‑3.2‑1B‑Instruct (average rank 3.44 ± 1.31).

Plasticity (fine‑tuned gain – baseline) is highest for the smallest models; they gain the most relative improvement, effectively narrowing the gap with larger models.

Question 3: Which model has the best baseline?

Champion: Qwen3‑8B (average rank 1.75 ± 0.72).

Without any fine‑tuning, the 8B model consistently ranks near the top with the smallest variance, making it the most reliable out‑of‑the‑box performer.

Question 4: Can the best student catch up to the teacher?

Answer: Yes. Qwen3‑4B‑Instruct‑2507 matches or exceeds the teacher on seven of eight benchmarks, tying on one and lagging slightly on Banking77 (within confidence bounds). Notably, it outperforms the teacher by 19 points on SQuAD 2.0.

Across the eight tasks, the 4B student surpasses the 120B+ teacher on six, ties on one, and is marginally behind on one (Banking77, within error margin). The 19‑point lead on SQuAD 2.0 demonstrates that fine‑tuning can embed domain knowledge into a small model.

Practical Model Selection Table

Goal: Highest accuracy – Choose Qwen3‑4B‑Instruct‑2507 (best overall after fine‑tuning).

Very tight compute (<2B) – Choose Llama‑3.2‑1B or Qwen3‑0.6B (highest plasticity).

Cannot fine‑tune – Choose Qwen3‑8B (strongest zero‑/few‑shot).

Edge deployment (mobile/IoT) – Choose Qwen3‑0.6B (smallest size, still plastic).

Next Steps

Expand model list: add upcoming SLMs such as Qwen3.5, Phi‑4, Mistral.

Increase benchmark repetitions to shrink confidence intervals.

Include additional tasks like summarization, code generation, and multi‑turn dialogue.

Training Details

All models were fine‑tuned with the same distillation pipeline: the teacher (GPT‑OSS‑120B) generated 10 k synthetic examples per task; training used 4 epochs, learning rate 5e‑5, linear decay, LoRA rank 64. Training and test sets were fully disjoint.

Conclusion

Base model quality varies, but fine‑tuning quickly narrows the gap. The benchmark shows Qwen3‑4B‑Instruct‑2507 is overall strongest and can achieve near‑teacher performance on a single consumer‑grade GPU with roughly 1/30 the inference cost, while very small models (e.g., Llama‑3.2‑1B) achieve remarkable gains due to high plasticity.

One‑liner: Fine‑tuning matters more than the choice of base model – a well‑tuned 1B model can outshine an 8B model that relies only on prompting.

https://www.distillabs.ai/blog/we-benchmarked-12-small-language-models-across-8-tasks-to-find-the-best-base-model-for-fine-tuning
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AILLMFine-tuningModel comparisonbenchmarksmall language models
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.