Can a 5% Parameter LLM Rival Full‑Scale Models? Inside FairyR1‑32B
The Beijing University team unveils FairyR1‑32B, a 32‑billion‑parameter LLM built on DeepSeek‑R1‑Distill‑Qwen‑32B that uses self‑merging, multi‑teacher cross‑distillation, and lightweight distillation to achieve competitive math and code benchmark scores with only about 5% of the original model’s parameters.
Model Overview
FairyR1-32B is a 32‑billion‑parameter large language model built on the DeepSeek‑R1‑Distill‑Qwen‑32B base. The authors employ a "split‑merge distillation" pipeline that combines fine‑tuning with model‑merging to achieve high task performance while using only about 5% of the parameters of the full‑scale DeepSeek‑R1‑671B.
Key Technical Innovations
Self‑merging
Multi‑teacher cross‑distillation
Lightweight (low‑cost) distillation
Data Construction
Two domain‑specific datasets were curated:
Mathematics : raw data from the AI‑MO/NuminaMath‑1.5 dataset. Answers were generated by several teacher models, then filtered for correctness, token length (2 k–8 k tokens per example), and chain‑of‑thought quality. After multi‑stage filtering, ≈6.6 k math examples remained.
Programming : raw data from the open‑thoughts/OpenThoughts‑114k dataset. The same teacher‑generation and filtering pipeline was applied, with a length filter of 4 k–8 k tokens per example, yielding ≈3.8 k code examples.
Both subsets were used to train specialist models.
Training Procedure
Two specialist models (one for math, one for code) were trained independently under identical hyper‑parameters (learning rate, batch size) for roughly five epochs. The training used standard mixed‑precision AdamW optimization on a single‑node GPU cluster (e.g., 8×A100 40 GB), with gradient accumulation to fit the long sequences.
Model Merging
After independent training, the specialist checkpoints were merged into a single unified model using the AcreeFusion tool, which aligns weight spaces and performs a weighted average to preserve the capabilities of both domains.
Evaluation Results
FairyR1-32B was benchmarked on several public tasks and compared with the full‑scale DeepSeek‑R1‑671B and the distilled DeepSeek‑R1‑Distill‑Qwen‑32B baseline.
Math (AIME 2024) : 80.4 vs. 79.8 (DeepSeek‑R1‑671B) vs. 72.6 (Distilled baseline)
Math (AIME 2025) : 75.6 vs. 70.0 vs. 52.9
Code (LiveCodeBench) : 67.7 vs. 65.9 vs. 57.2
Science QA (GPQA‑Diamond) : 59.6 vs. 71.5 vs. 62.1
The results demonstrate that, with only ~5% of the original parameter count, FairyR1‑32B matches or slightly exceeds the full‑scale model on math and code benchmarks, while lagging on scientific QA.
Release
The model weights, training scripts, and merging code are publicly available on Hugging Face:
https://huggingface.co/PKU-DS-LAB/FairyR1-32B
Figures
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
