17 min read

How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation

The article introduces the DistilQwen2.5‑R1 series, which leverages a novel knowledge‑distillation pipeline—including CoT data evaluation, improvement, and validation—to transfer deep reasoning abilities from large models like DeepSeek‑R1 to compact models, achieving superior performance across math, code, and scientific benchmarks and providing open‑source checkpoints and deployment guides for practical use.

Alibaba Cloud Big Data AI Platform

Mar 29, 2025

How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation

Introduction

With the open‑source release of large‑scale deep‑reasoning models such as DeepSeek‑R1 and QwQ‑32B, the "large model + slow thinking" paradigm has become standard for extending LLM capabilities. However, deploying these models on resource‑constrained devices remains challenging, prompting the need for effective knowledge‑distillation techniques to transfer their capabilities to smaller models.

DistilQwen2.5‑R1 Series

Building on the DistilQwen2.5 family, the DistilQwen2.5‑R1 series incorporates a small amount of DeepSeek‑R1 chain‑of‑thought (CoT) data and applies a series of innovative distillation strategies to strengthen the deep‑thinking ability of compact models. Experiments show that models such as DistilQwen2.5‑R1‑7B outperform other open‑source distilled models, including OpenThinker‑7B.

Knowledge Distillation Technique

The training framework consists of two stages:

Stage 1 – CoT data "evaluation‑improvement‑validation" mechanism.

Stage 2 – Preference optimization using diverse cognitive‑trajectory data.

Because large and small models often follow different reasoning trajectories, directly distilling raw CoT data can be ineffective. The framework first assesses the difficulty of each CoT example (simple, medium, hard) based on whether a small model can follow the reasoning to reach the answer. Simple examples are expanded, hard examples are simplified, and medium examples are retained. The refined dataset is then used for supervised fine‑tuning (SFT) to endow the small model with a solid reasoning foundation.

CoT Data "Evaluation‑Improvement‑Validation" Mechanism

An LLM‑as‑a‑Judge paradigm evaluates each CoT chain, improves it according to its difficulty level, and re‑validates the result. Only data that become medium‑difficulty after improvement are kept for training.

Preference Optimization with Multiple Cognitive Trajectories

In Stage 2, the refined medium‑difficulty CoT data are paired with deliberately corrupted (incorrect) chains to form preference pairs. These pairs, varying in bias magnitude, are used with the DPO algorithm to further enhance the small model's reasoning ability.

Model Evaluation

The DistilQwen2.5‑R1 models are evaluated on four major benchmarks covering mathematics (AIME2024, MATH‑500), code (LiveCodeBench V2), and scientific QA (GPQA‑Diamond). Across 3B, 7B, 14B, and 32B parameter scales, the series consistently outperforms the original Qwen2.5 and other state‑of‑the‑art models, especially the 7B variant, which achieves higher scores than OpenThinker‑7B while using only open‑source training data.

Multi‑turn inference (Pass@k) experiments show that increasing the number of generated answers dramatically improves accuracy, allowing the 7B model to approach the performance of the 32B model.

Model Download and Usage

All checkpoints are publicly available on Hugging Face and Model Scope. Example code for loading and inference on Alibaba Cloud PAI (using transformers>=4.37.0) is provided:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "alibaba-pai/DistilQwen2.5-R1-7B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "xxxxx"
messages = [
    {"role": "system", "content": "Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions..."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Downloading via huggingface_hub is also illustrated for all model sizes.

Conclusion and Future Work

The DistilQwen2.5‑R1 series demonstrates that a small amount of high‑quality CoT data, combined with systematic evaluation‑improvement‑validation and preference‑based optimization, can endow compact models with strong deep‑reasoning capabilities while significantly reducing deployment costs. Future work will expand the family to cover more domains and scales, further promoting cost‑effective LLM adoption.

Diagram of the DistilQwen2.5‑R1 distillation framework

model compression large language models AI inference knowledge distillation benchmark evaluation

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.