How TAPIR Boosts Small LLMs with Task‑Aware Curriculum Planning

The paper introduces TAPIR, a task‑aware curriculum planning framework that distills instruction‑following abilities from black‑box LLM teachers into smaller student models by filtering difficult prompts, resampling tasks, enhancing response styles, and iteratively optimizing across multiple training rounds, achieving superior performance on benchmark evaluations.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How TAPIR Boosts Small LLMs with Task‑Aware Curriculum Planning

Background

Large language models have made great progress in following open‑domain instructions, and instruction fine‑tuning is key to turning a text‑completion model into a strong dialogue model. Existing works that use powerful black‑box teacher models (e.g., GPT‑4, Qwen‑max) for automatic distillation often ignore the diversity of tasks and difficulty variance in the fine‑tuning dataset, leading to imbalanced knowledge and poor performance on complex tasks. To address these challenges, the authors propose TAPIR (Task‑Aware Curriculum Planning for Instruction Refinement), which uses multi‑task curriculum planning to distill instruction‑following ability from black‑box LLMs into smaller student models.

Algorithm Flow

The TAPIR framework follows a multi‑round distillation process:

Dataset difficulty filtering: An open‑source instruction dataset (e.g., Alpaca) is filtered by computing a Model Fitting Difficulty (MFD) score, selecting instruction pairs that are hard for the student model to answer.

Multi‑task curriculum instruction distillation: Based on a predefined task‑type ratio, a teacher model (e.g., ChatGPT) expands the seed dataset, generating more instruction‑response pairs of similar difficulty and increasing the sampling probability of reasoning tasks to alleviate capability conflicts.

Multi‑task response style enhancement: For certain tasks, specific prompts rewrite the teacher’s responses to obtain finer‑grained or format‑specific answers (e.g., chain‑of‑thought, code comments), helping the student model learn complex tasks.

Multi‑round model optimization: After each round, a judge model evaluates the student’s responses, providing reward scores that guide the sampling of a new seed dataset with higher difficulty, gradually increasing task challenge.

Difficulty Resampling

Difficulty resampling addresses the uneven difficulty distribution in the training set. The MFD score, calculated as the quality gap between student and teacher responses, identifies hard instruction pairs (gap > threshold) to be added to the seed dataset, ensuring the student model encounters increasingly challenging tasks.

Task Resampling

Task resampling balances the distribution of task types. A DeBERTa‑v3 classifier tags each instruction with its task type; the dataset is then re‑sampled to achieve a more uniform task mix, especially boosting logical reasoning and programming tasks. The teacher model expands the seed data according to these sampled probabilities, producing new instruction‑answer pairs of comparable difficulty.

Multi‑Round Iterative Optimization

During each iteration, the model fitting difficulty of the student on the newly fine‑tuned data is recomputed, and the proportion of difficult samples is gradually increased (controlled by a predefined constant). The loss for round r is defined accordingly, and after each round the update rule adjusts the seed dataset for the next iteration.

Experimental Results

Experiments show that TAPIR‑trained student models outperform larger instruction‑tuned models and other distillation baselines despite using far less training data. On the AlpacaEval 2.0 benchmark, TAPIR achieves a 7.80 win rate, surpassing Vicuna 13B and LLaMA2‑Chat 13B while using only half the data and parameters. On MT‑Bench, TAPIR excels across role‑play, reasoning, math, coding, and humanities sub‑tasks compared to LLaMA2 7B Chat. Additional experiments on Qwen1.5‑Chat series confirm TAPIR’s consistent improvement of instruction‑following ability across model scales.

References

Li, M., Chen, L., Chen, J., He, S., Huang, H., Gu, J., & Zhou, T. "Reflection‑Tuning: Data Recycling Improves LLM Instruction‑Tuning." arXiv:2310.11716.

Song, C., Zhou, Z., Yan, J., Fei, Y., Lan, Z., & Zhang, Y. "Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace." arXiv:2310.19651.

Jiang, Y., Chan, C., Chen, M., & Wang, W. "Lion: Adversarial Distillation of Proprietary Large Language Models." EMNLP 2023.

Paper Information

Title: Distilling Instruction‑following Abilities of Large Language Models with Task‑aware Curriculum Planning

Authors: Yue Yuanhao, Wang Chengyu, Huang Jun, Wang Peng

PDF: https://arxiv.org/pdf/2405.13448

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

knowledge distillationcurriculum learningInstruction TuningLLM distillationTAPIR
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.