21 min read

How DistilQwen2 Boosts LLM Performance with Knowledge Distillation

This article introduces DistilQwen2, a lightweight language model derived from Qwen2 via knowledge distillation, detailing its data collection, instruction‑data optimization, training strategies, extensive benchmark evaluations, and practical deployment guides for developers and enterprises.

Alibaba Cloud Big Data AI Platform

Nov 5, 2024

How DistilQwen2 Boosts LLM Performance with Knowledge Distillation

Background

Large language models (LLMs) such as Qwen2 have become research hotspots, but their high computational cost limits deployment on resource‑constrained devices. Knowledge distillation offers a way to compress models while preserving performance.

DistilQwen2 Overview

DistilQwen2 is a parameter‑efficient LLM built on Qwen2 using instruction‑following knowledge distillation. By analyzing Qwen2, enriching instruction data, and exploring multiple distillation algorithms, the model achieves stronger instruction compliance with far fewer parameters, making it suitable for mobile and edge scenarios.

Data Collection and Diversity

We gathered public datasets (Magpie, Openhermes, Mammoth 2) and private synthetic data. Instruction data are bilingual and are scored for difficulty using an LLM‑as‑a‑Judge pipeline (teacher model answers vs. student model answers) to compute a Model‑Fit‑Difficulty (MFD) score. Low‑value instructions are filtered out.

Three diversity dimensions are considered:

Task diversity : 33 task types annotated on a 30 k‑sample classifier (86% agreement with ChatGPT, 93% human accuracy).

Length diversity : Normal‑distribution‑based sampling ensures a balanced long‑tail of instruction lengths.

Language diversity : Chinese data are expanded with Qwen‑max to match English volume.

Instruction Data Optimization

Teacher models generate expanded data via prompts; multi‑turn dialogues are constructed by forcing the teacher to continue from the previous answer. Teacher responses are optimized for format, style, and length, and a self‑distillation step (using Qwen2‑7B‑Instruct) reduces distribution gaps.

Distillation Training

Two training stages are used: Supervised Fine‑Tuning (SFT) followed by Direct Preference Optimization (DPO). DPO compares teacher and student outputs with a Bradley‑Terry model, adding a length‑normalization term to avoid overly short replies.

Evaluation – Instruction Following

DistilQwen2‑1.5B‑Instruct and DistilQwen2‑7B‑Instruct were benchmarked on AlpacaEval 2.0 (length‑controlled), MT‑Bench (single & multi‑turn), and IFEval (loose & strict). Both models consistently outperformed their full‑size Qwen2 counterparts.

Model                     AlpacaEval2.0   MT‑Bench   MT‑Bench(single)   IFEval(loose)   IFEval(strict)
Qwen2‑1.5B‑Instruct          5.22          5.85          6.45            41.37           28.10
DistilQwen2‑1.5B‑Instruct    8.28          6.42          7.12            49.76           36.04
Qwen2‑7B‑Instruct           24.33         8.27          8.68            66.67           52.31
DistilQwen2‑7B‑Instruct     25.35         8.40          9.03            71.46           60.26

Evaluation – General Ability

We also measured knowledge (MMLU, CEval, CMMLU) and reasoning (GSM8K, HumanEval, MBPP) abilities. DistilQwen2 models matched or exceeded the original Qwen2 scores across all tasks.

Model                     MMLU   CEval   CMMLU   GSM8K   HumanEval   MBPP   Avg
Qwen2‑1.5B‑Instruct      55.58  68.87   69.70   59.06   46.34      30.40  54.99
DistilQwen2‑1.5B‑Instruct56.07  69.24   69.78   60.27   51.83      32.80  56.66
Qwen2‑7B‑Instruct        69.77  81.51   80.29   86.66   78.05      53.04  74.89
DistilQwen2‑7B‑Instruct  69.80  81.28   81.20   86.66   84.15      56.00  76.52

Practical Deployment on Alibaba Cloud PAI

Using the transformers library (v≥4.37.0), the model can be loaded and invoked on PAI‑DSW. Example code:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "alibaba-pai/DistilQwen2-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "请给我简单介绍一下杭州西湖。"
messages = [{"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

The checkpoints are publicly available on HuggingFace and ModelScope under alibaba-pai/DistilQwen2-1.5B-Instruct and alibaba-pai/DistilQwen2-7B-Instruct.

Conclusion and Future Work

DistilQwen2 demonstrates that knowledge distillation can deliver high‑quality, instruction‑following LLMs with a fraction of the parameters, enabling efficient deployment on mobile and edge devices. Future directions include expanding distillation algorithms, refining fine‑tuning strategies for specific tasks, and enriching open‑source tooling.

AI model compression knowledge distillation Instruction Tuning

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.