How DistilQwen2 Boosts LLM Performance with Knowledge Distillation
This article introduces DistilQwen2, a lightweight language model derived from Qwen2 via knowledge distillation, detailing its data collection, instruction‑data optimization, training strategies, extensive benchmark evaluations, and practical deployment guides for developers and enterprises.
Background
Large language models (LLMs) such as Qwen2 have become research hotspots, but their high computational cost limits deployment on resource‑constrained devices. Knowledge distillation offers a way to compress models while preserving performance.
DistilQwen2 Overview
DistilQwen2 is a parameter‑efficient LLM built on Qwen2 using instruction‑following knowledge distillation. By analyzing Qwen2, enriching instruction data, and exploring multiple distillation algorithms, the model achieves stronger instruction compliance with far fewer parameters, making it suitable for mobile and edge scenarios.
Data Collection and Diversity
We gathered public datasets (Magpie, Openhermes, Mammoth 2) and private synthetic data. Instruction data are bilingual and are scored for difficulty using an LLM‑as‑a‑Judge pipeline (teacher model answers vs. student model answers) to compute a Model‑Fit‑Difficulty (MFD) score. Low‑value instructions are filtered out.
Three diversity dimensions are considered:
Task diversity : 33 task types annotated on a 30 k‑sample classifier (86% agreement with ChatGPT, 93% human accuracy).
Length diversity : Normal‑distribution‑based sampling ensures a balanced long‑tail of instruction lengths.
Language diversity : Chinese data are expanded with Qwen‑max to match English volume.
Instruction Data Optimization
Teacher models generate expanded data via prompts; multi‑turn dialogues are constructed by forcing the teacher to continue from the previous answer. Teacher responses are optimized for format, style, and length, and a self‑distillation step (using Qwen2‑7B‑Instruct) reduces distribution gaps.
Distillation Training
Two training stages are used: Supervised Fine‑Tuning (SFT) followed by Direct Preference Optimization (DPO). DPO compares teacher and student outputs with a Bradley‑Terry model, adding a length‑normalization term to avoid overly short replies.
Evaluation – Instruction Following
DistilQwen2‑1.5B‑Instruct and DistilQwen2‑7B‑Instruct were benchmarked on AlpacaEval 2.0 (length‑controlled), MT‑Bench (single & multi‑turn), and IFEval (loose & strict). Both models consistently outperformed their full‑size Qwen2 counterparts.
Model AlpacaEval2.0 MT‑Bench MT‑Bench(single) IFEval(loose) IFEval(strict)
Qwen2‑1.5B‑Instruct 5.22 5.85 6.45 41.37 28.10
DistilQwen2‑1.5B‑Instruct 8.28 6.42 7.12 49.76 36.04
Qwen2‑7B‑Instruct 24.33 8.27 8.68 66.67 52.31
DistilQwen2‑7B‑Instruct 25.35 8.40 9.03 71.46 60.26Evaluation – General Ability
We also measured knowledge (MMLU, CEval, CMMLU) and reasoning (GSM8K, HumanEval, MBPP) abilities. DistilQwen2 models matched or exceeded the original Qwen2 scores across all tasks.
Model MMLU CEval CMMLU GSM8K HumanEval MBPP Avg
Qwen2‑1.5B‑Instruct 55.58 68.87 69.70 59.06 46.34 30.40 54.99
DistilQwen2‑1.5B‑Instruct56.07 69.24 69.78 60.27 51.83 32.80 56.66
Qwen2‑7B‑Instruct 69.77 81.51 80.29 86.66 78.05 53.04 74.89
DistilQwen2‑7B‑Instruct 69.80 81.28 81.20 86.66 84.15 56.00 76.52Practical Deployment on Alibaba Cloud PAI
Using the transformers library (v≥4.37.0), the model can be loaded and invoked on PAI‑DSW. Example code:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "alibaba-pai/DistilQwen2-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "请给我简单介绍一下杭州西湖。"
messages = [{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)The checkpoints are publicly available on HuggingFace and ModelScope under alibaba-pai/DistilQwen2-1.5B-Instruct and alibaba-pai/DistilQwen2-7B-Instruct.
Conclusion and Future Work
DistilQwen2 demonstrates that knowledge distillation can deliver high‑quality, instruction‑following LLMs with a fraction of the parameters, enabling efficient deployment on mobile and edge devices. Future directions include expanding distillation algorithms, refining fine‑tuning strategies for specific tasks, and enriching open‑source tooling.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
