26 min read

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

This article introduces DistilQwen2.5, a lightweight LLM series built on Qwen2.5 that uses a novel two‑layer distillation framework, instruction‑data optimization, and parameter‑fusion techniques to achieve higher performance while drastically reducing computational cost and deployment overhead.

Alibaba Cloud Big Data AI Platform

Feb 25, 2025

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

Introduction

High computational cost and complexity limit the adoption of large language models (LLMs) in resource‑constrained environments such as mobile and edge devices. To retain model performance while improving efficiency and lowering deployment cost, the DistilQwen2.5 series was released.

Distillation Framework

DistilQwen2.5 employs a dual‑layer knowledge distillation pipeline:

Instruction Data Collection Layer

Large‑scale, high‑quality instruction data are gathered from public datasets (Magpie, Openhermes, Mammoth 2) and private synthetic data. Chinese‑English balance is achieved via Qwen‑max data expansion, and a task classifier (trained on 33 task types with 30 k examples) provides fine‑grained task labels. An LLM‑as‑Judge evaluates instruction difficulty using a Model‑Fit‑Difficulty (MFD) score, filtering low‑value samples.

Instruction‑Following Optimization (Black‑Box Distillation)

Instruction expansion: a teacher model generates semantically similar new instructions while preserving the original task type.

Instruction selection: an agent filters instructions based on information content, usefulness, and generalization potential.

Instruction rewriting: an agent rewrites selected instructions, encouraging chain‑of‑thought outputs for complex tasks.

The student model learns from the enhanced instruction‑response pairs without accessing teacher internal representations.

Knowledge Fusion Optimization (White‑Box Distillation)

White‑box distillation aligns the teacher’s logits distribution with the student’s, providing richer supervision than black‑box token‑level loss. To make this scalable, only the top‑k (k≈10) token probabilities are stored, token alignment resolves vocabulary mismatches, and a reduced‑divergence loss is computed on these top‑k logits.

Evaluation

DistilQwen2.5 was evaluated on multiple instruction‑following benchmarks (AlpacaEval 2.0, MT‑Bench, IFEval) across four model sizes (0.5 B, 1.5 B, 3 B, 7 B). Results show consistent gains over the original Qwen2.5 in overall win rates and fine‑grained abilities such as code generation, math reasoning, and role‑play.

Comparisons with other LLM families (Llama, Phi‑3, Mistral) demonstrate that DistilQwen2.5 offers a superior performance‑to‑parameter‑ratio, often matching or surpassing models with twice the parameter count.

Model‑fusion experiments reveal that optimizing teacher logits yields at least a 4× speedup in inference and reduces storage to 1/1000 of the original size.

Practical Deployment

DistilQwen2.5 checkpoints are open‑sourced on HuggingFace and ModelScope. Example code for loading the 7 B‑Instruct variant on Alibaba Cloud PAI‑DSW (transformers ≥ 4.37.0) is provided:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "alibaba-pai/DistilQwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "请给我简单介绍一下杭州西湖。"
messages = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
response = tokenizer.batch_decode([output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)], skip_special_tokens=True)[0]
print(response)

Dataset DistilQwen_100K (100 K JSON records covering math, code, knowledge Q&A, and creative generation) is also released to mitigate catastrophic forgetting during fine‑tuning.

Conclusion and Future Work

DistilQwen2.5 demonstrates that LLMs can be substantially compressed without sacrificing capability, enabling broader deployment in low‑resource scenarios. Future directions include further model optimization for specialized domains such as deep reasoning and continued expansion of open‑source resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM model compression Knowledge Distillation Efficient Inference

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.