Artificial Intelligence 6 min read

Can a 5% Parameter LLM Rival Full‑Scale Models? Inside FairyR1‑32B

The Beijing University team unveils FairyR1‑32B, a 32‑billion‑parameter LLM built on DeepSeek‑R1‑Distill‑Qwen‑32B that uses self‑merging, multi‑teacher cross‑distillation, and lightweight distillation to achieve competitive math and code benchmark scores with only about 5% of the original model’s parameters.

AI Frontier Lectures

May 30, 2025

Can a 5% Parameter LLM Rival Full‑Scale Models? Inside FairyR1‑32B

Model Overview

FairyR1-32B is a 32‑billion‑parameter large language model built on the DeepSeek‑R1‑Distill‑Qwen‑32B base. The authors employ a "split‑merge distillation" pipeline that combines fine‑tuning with model‑merging to achieve high task performance while using only about 5% of the parameters of the full‑scale DeepSeek‑R1‑671B.

Key Technical Innovations

Self‑merging

Multi‑teacher cross‑distillation

Lightweight (low‑cost) distillation

Data Construction

Two domain‑specific datasets were curated:

Mathematics : raw data from the AI‑MO/NuminaMath‑1.5 dataset. Answers were generated by several teacher models, then filtered for correctness, token length (2 k–8 k tokens per example), and chain‑of‑thought quality. After multi‑stage filtering, ≈6.6 k math examples remained.

Programming : raw data from the open‑thoughts/OpenThoughts‑114k dataset. The same teacher‑generation and filtering pipeline was applied, with a length filter of 4 k–8 k tokens per example, yielding ≈3.8 k code examples.

Both subsets were used to train specialist models.

Training Procedure

Two specialist models (one for math, one for code) were trained independently under identical hyper‑parameters (learning rate, batch size) for roughly five epochs. The training used standard mixed‑precision AdamW optimization on a single‑node GPU cluster (e.g., 8×A100 40 GB), with gradient accumulation to fit the long sequences.

Model Merging

After independent training, the specialist checkpoints were merged into a single unified model using the AcreeFusion tool, which aligns weight spaces and performs a weighted average to preserve the capabilities of both domains.

Evaluation Results

FairyR1-32B was benchmarked on several public tasks and compared with the full‑scale DeepSeek‑R1‑671B and the distilled DeepSeek‑R1‑Distill‑Qwen‑32B baseline.

Math (AIME 2024) : 80.4 vs. 79.8 (DeepSeek‑R1‑671B) vs. 72.6 (Distilled baseline)

Math (AIME 2025) : 75.6 vs. 70.0 vs. 52.9

Code (LiveCodeBench) : 67.7 vs. 65.9 vs. 57.2

Science QA (GPQA‑Diamond) : 59.6 vs. 71.5 vs. 62.1

The results demonstrate that, with only ~5% of the original parameter count, FairyR1‑32B matches or slightly exceeds the full‑scale model on math and code benchmarks, while lagging on scientific QA.

Release

The model weights, training scripts, and merging code are publicly available on Hugging Face:

https://huggingface.co/PKU-DS-LAB/FairyR1-32B

Figures

Code example

收
藏
，
分
享
、
在
看
，
给
个
三
连
击呗！

model compression Large Language Model Distillation

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.