15 min read

Unlocking Small LLM Power: Variable‑Length Chain Distillation with DistillQwen‑ThoughtY

This article introduces a variable‑length chain‑of‑thought distillation technique built on Alibaba Cloud PAI’s EasyDistill toolkit, presents the high‑quality OmniThought‑0528 dataset, details the training of the DistillQwen‑ThoughtY 4B/8B/32B models, and provides code and usage examples for researchers and practitioners.

Alibaba Cloud Big Data AI Platform

Jun 30, 2025

Unlocking Small LLM Power: Variable‑Length Chain Distillation with DistillQwen‑ThoughtY

Background

The rapid breakthroughs of large language models (LLMs) have transformed natural language processing, but long chain‑of‑thought (CoT) reasoning models such as OpenAI o1 and DeepSeek‑R1 face two practical challenges: massive model size leading to high deployment cost, and redundant reasoning paths that reduce efficiency and accuracy.

Variable‑Length CoT Distillation with EasyDistill

Using Alibaba Cloud AI Platform (PAI)’s open‑source EasyDistill toolkit, we propose a variable‑length CoT distillation method that compresses the reasoning ability of large teachers into compact student models. Based on this method we released the largest high‑quality variable‑length CoT dataset, OmniThought‑0528 , and the DistillQwen‑ThoughtY series of distilled models.

OmniThought‑0528 Dataset Construction

Building on the earlier OmniThought dataset, we collected reasoning problems from multiple public sources covering mathematics, coding, and science. For each problem we generated multiple CoT answers using DeepSeek‑R1 and QwQ‑32B as teachers, filtered them with an “LLM‑as‑a‑judge” pipeline, and annotated each chain with Reasoning Verbosity (RV) and Cognitive Difficulty (CD) . The final dataset contains 365,000 entries, each formatted as JSON with fields question, reasoning (including Cognitive_Difficulty, Reasoning_Verbosity, and full_response).

Generating and Scoring CoT Data with EasyDistill

EasyDistill provides a simple workflow to generate and evaluate CoT data.

git clone https://github.com/modelscope/easydistill
cd EasyDistill
pip install -r requirements.txt

Example configuration for CoT generation:

{
  "job_type": "cot_generation_api",
  "dataset": {
    "input_path": "./cot_question.json",
    "output_path": "./cot_question_with_answer.json"
  },
  "inference": {
    "base_url": "ENDPOINT",
    "api_key": "TOKEN",
    "stream": true,
    "prompt": "Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. ...",
    "max_new_tokens": 1024
  }
}

Configuration for CoT evaluation:

{
  "job_type": "cot_eval_api",
  "dataset": {
    "input_path": "cot_input.json",
    "output_path": "cot_output.json"
  },
  "inference": {
    "base_url": "ENDPOINT",
    "api_key": "TOKEN",
    "max_new_tokens": 8196
  }
}

Run the evaluation:

python ./eval/data_eval.py --config .configs/cot_eval_api.json

DistillQwen‑ThoughtY Model Training

Using the OmniThought and OmniThought‑0528 datasets, we trained three DistillQwen‑ThoughtY models (4B, 8B, 32B) on Qwen‑3 student backbones and DeepSeek‑R1‑0528 teachers. Training used 8 A800 80 GB GPUs for the 4B/8B models and 4 × 32 A800 80 GB GPUs for the 32B model. Hyper‑parameters: learning rate 5e‑5, 3 epochs, sequence length 8192, batch size 1 with gradient accumulation 8, cosine LR schedule.

Compared with baseline models and previous DistillQwen‑ThoughtX, the new series shows significant gains on mathematics, code, and general reasoning benchmarks, demonstrating the effectiveness of variable‑length CoT distillation.

Training Script

accelerate launch --num_processes n \
  --config_file ./configs/train-config/muti_gpu.yaml ./easydistill/black-box/train.py \
  --config ./configs/kd_black_box_local.json

Downloading and Using the Models

The distilled models are publicly available on Hugging Face and ModelScope.

from huggingface_hub import snapshot_download
model_name = "alibaba-pai/DistillQwen-ThoughtY-4B"
snapshot_download(repo_id=model_name, cache_dir="./DistillQwen-ThoughtY-4B/")
model_name = "alibaba-pai/DistillQwen-ThoughtY-8B"
snapshot_download(repo_id=model_name, cache_dir="./DistillQwen-ThoughtY-8B/")
model_name = "alibaba-pai/DistillQwen-ThoughtY-32B"
snapshot_download(repo_id=model_name, cache_dir="./DistillQwen-ThoughtY-32B/")

Example with ModelScope:

from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "pai/DistillQwen-ThoughtY-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
prompt = "Solve ∫x e^x dx. Show your reasoning step‑by‑step."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = tokenizer([text], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=32768)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

OmniThought‑0528 Dataset Access

The dataset is also open‑source on Hugging Face and ModelScope.

from datasets import load_dataset
OmniThought = load_dataset("alibaba-pai/OmniThought-0528")

from modelscope.msdatasets import MsDataset
ds = MsDataset.load('PAI/OmniThought-0528')

Model Deployment on PAI

DistillQwen‑ThoughtY models are listed in the PAI‑Model Gallery. Users can deploy them directly via the gallery interface (see https://x.sm.cn/3hs5QWH for details).

Conclusion

By leveraging the EasyDistill framework and the OmniThought‑0528 dataset, we demonstrate that variable‑length chain‑of‑thought distillation can substantially improve the reasoning capabilities of compact LLMs across mathematics, science, and code generation tasks. The models, data, and toolkit are fully open‑source, inviting the community to further explore efficient LLM distillation.

LLM model training chain of thought dataset Distillation

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.