How Mixed Data Shapes LLaMA SFT: Scaling Trends, Conflict Zones, and the DMT Remedy
This article investigates how mixing data from mathematical reasoning, code generation, and general instruction-following tasks influences supervised fine‑tuning of LLaMA models, revealing distinct scaling curves, resource‑dependent performance conflicts, and a two‑stage DMT strategy that mitigates catastrophic forgetting while boosting overall capability.
Data Composition Problem in Supervised Fine‑Tuning (SFT)
During SFT of large language models, mixed datasets covering distinct abilities (mathematical reasoning, code generation, and general instruction following) are injected to broaden model capabilities. Because each ability’s data differs in source, domain, distribution, and scale, the data composition problem asks how model performance varies with data quantity, proportion, model size, and training strategy.
Target Abilities and Evaluation Sets
Mathematical Reasoning : GSM8K‑RFT (7.5K problems, 110K answers) – evaluated on the GSM8K test set.
Code Generation : Code Alpaca (20K instruction‑code pairs) – evaluated with HumanEval.
General Instruction Following : ShareGPT (≈90K multi‑turn dialogues) – evaluated with MT‑Bench.
Research Questions
The study adopts a data‑scaling perspective and investigates four questions (RQ1‑RQ4) by varying data volume, data mix, model parameters (LLaMA 7B, 13B, 33B), and SFT strategies.
RQ1 – Scaling Trends with Increasing Data
Experimental setup : For each ability, training data were sampled at ratios {1, 1/4, 1/16, 1/64, 1/256} and used to fine‑tune LLaMA models of different sizes.
Key findings
Each ability exhibits a distinct scaling curve. Math and general abilities improve monotonically with more data. Code ability shows irregular behavior for 7B/13B models but a log‑linear trend for 33B.
When data are abundant, larger models consistently outperform smaller ones.
RQ2 – Does Direct Mixing of the Three Ability Datasets Cause Conflict?
Experimental setup
Single‑source condition : Fine‑tune LLaMA on each ability dataset alone using the same data ratios.
Mixed‑source condition : Combine the three abilities in equal proportion at the same ratios and fine‑tune.
Key findings
Mixed data yields performance gains in low‑resource regimes (e.g., 1/256) but incurs conflicts in high‑resource regimes (full data), where each ability’s scaling curve underperforms the single‑source baseline.
Higher model capacity amplifies the low‑resource gains for math and general abilities.
RQ3 – Factors Driving the Observed Conflicts
Experimental setup
Fix general data, scale specific (code, math) data.
Fix specific data, scale general data.
Fix 1/64 of general data, scale specific data.
Key findings
When task formats and data distributions differ markedly (e.g., math vs. general), the proportion of each dataset has little effect on performance.
When distributions are more similar (code vs. general), the data ratio can cause noticeable performance fluctuations.
Even with severely limited general data, scaling specific‑ability data does not significantly harm general ability.
RQ4 – Influence of Different SFT Training Strategies
Training strategies explored
Multi‑task learning : Directly mix all ability datasets and treat each as a separate task.
Sequential training : Fine‑tune each ability dataset in order, placing general data last.
Mixed‑sequential training : Multi‑task learn on specific abilities first, then fine‑tune on general data.
Two‑stage Mixed‑Fine‑Tuning (DMT) : Stage 1 – multi‑task learn on specific abilities (code, math); Stage 2 – fine‑tune on a mixture of general data plus a small proportion k of specific data (k ∈ {1, 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/256}).
Key findings
Multi‑task learning preserves specific abilities but causes a sharp drop in general ability.
Both sequential strategies retain general ability but suffer catastrophic forgetting of specific abilities, especially math.
The DMT strategy with a very small k (e.g., 1/256) improves math and code performance across 7B, 13B, and 33B models and also yields modest gains for general ability.
Discussion
Semantic Representation Visualization
Layer‑15 representations of LLaMA‑13B and DMT‑13B show clear separation for math ability, while code and general abilities still exhibit overlapping clusters, indicating partial semantic collapse.
Ablation of Code and Math Samples from General Data
Using InsTag to filter out math and code samples from ShareGPT (reducing the set from 90K to 63K) does not affect the low‑resource performance boost, suggesting that diversity rather than specific content drives the gain.
Effect of Specific‑Ability Ratio k in Stage 2 of DMT
Increasing k from 0 to 1/256 improves both specific and general abilities. Raising k beyond 1/4 degrades general performance, confirming the high‑resource conflict observed in RQ2. The optimal k depends on the desired balance between abilities.
Conclusion
Mixing multiple ability datasets during SFT exhibits a “high‑resource conflict, low‑resource gain” pattern. The proposed DMT strategy—first fine‑tuning on specific abilities, then fine‑tuning on a mixture of general data plus a small fraction of specific data—effectively preserves general capability while rescuing catastrophic forgetting of specialized skills. The proportion k in the second stage should be tuned to the target application.
https://arxiv.org/pdf/2310.05492.pdfBaobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
