Can a 5% Parameter LLM Rival Full‑Scale Models? Inside FairyR1‑32B

The Beijing University team unveils FairyR1‑32B, a 32‑billion‑parameter LLM built on DeepSeek‑R1‑Distill‑Qwen‑32B that uses self‑merging, multi‑teacher cross‑distillation, and lightweight distillation to achieve competitive math and code benchmark scores with only about 5% of the original model’s parameters.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Can a 5% Parameter LLM Rival Full‑Scale Models? Inside FairyR1‑32B

Model Overview

FairyR1-32B is a 32‑billion‑parameter large language model built on the DeepSeek‑R1‑Distill‑Qwen‑32B base. The authors employ a "split‑merge distillation" pipeline that combines fine‑tuning with model‑merging to achieve high task performance while using only about 5% of the parameters of the full‑scale DeepSeek‑R1‑671B.

Key Technical Innovations

Self‑merging

Multi‑teacher cross‑distillation

Lightweight (low‑cost) distillation

Data Construction

Two domain‑specific datasets were curated:

Mathematics : raw data from the AI‑MO/NuminaMath‑1.5 dataset. Answers were generated by several teacher models, then filtered for correctness, token length (2 k–8 k tokens per example), and chain‑of‑thought quality. After multi‑stage filtering, ≈6.6 k math examples remained.

Programming : raw data from the open‑thoughts/OpenThoughts‑114k dataset. The same teacher‑generation and filtering pipeline was applied, with a length filter of 4 k–8 k tokens per example, yielding ≈3.8 k code examples.

Both subsets were used to train specialist models.

Training Procedure

Two specialist models (one for math, one for code) were trained independently under identical hyper‑parameters (learning rate, batch size) for roughly five epochs. The training used standard mixed‑precision AdamW optimization on a single‑node GPU cluster (e.g., 8×A100 40 GB), with gradient accumulation to fit the long sequences.

Model Merging

After independent training, the specialist checkpoints were merged into a single unified model using the AcreeFusion tool, which aligns weight spaces and performs a weighted average to preserve the capabilities of both domains.

Evaluation Results

FairyR1-32B was benchmarked on several public tasks and compared with the full‑scale DeepSeek‑R1‑671B and the distilled DeepSeek‑R1‑Distill‑Qwen‑32B baseline.

Math (AIME 2024) : 80.4 vs. 79.8 (DeepSeek‑R1‑671B) vs. 72.6 (Distilled baseline)

Math (AIME 2025) : 75.6 vs. 70.0 vs. 52.9

Code (LiveCodeBench) : 67.7 vs. 65.9 vs. 57.2

Science QA (GPQA‑Diamond) : 59.6 vs. 71.5 vs. 62.1

The results demonstrate that, with only ~5% of the original parameter count, FairyR1‑32B matches or slightly exceeds the full‑scale model on math and code benchmarks, while lagging on scientific QA.

Release

The model weights, training scripts, and merging code are publicly available on Hugging Face:

https://huggingface.co/PKU-DS-LAB/FairyR1-32B

Figures

FairyR1 model illustration
FairyR1 model illustration
Benchmark comparison chart
Benchmark comparison chart
Model pipeline diagram
Model pipeline diagram

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
model compressionLarge Language ModelDistillation
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.