Artificial Intelligence 14 min read

How Mixed Data Shapes LLaMA SFT: Scaling Trends, Conflict Zones, and the DMT Remedy

This article investigates how mixing data from mathematical reasoning, code generation, and general instruction-following tasks influences supervised fine‑tuning of LLaMA models, revealing distinct scaling curves, resource‑dependent performance conflicts, and a two‑stage DMT strategy that mitigates catastrophic forgetting while boosting overall capability.

Baobao Algorithm Notes

Oct 25, 2023

How Mixed Data Shapes LLaMA SFT: Scaling Trends, Conflict Zones, and the DMT Remedy

Data Composition Problem in Supervised Fine‑Tuning (SFT)

During SFT of large language models, mixed datasets covering distinct abilities (mathematical reasoning, code generation, and general instruction following) are injected to broaden model capabilities. Because each ability’s data differs in source, domain, distribution, and scale, the data composition problem asks how model performance varies with data quantity, proportion, model size, and training strategy.

Target Abilities and Evaluation Sets

Mathematical Reasoning : GSM8K‑RFT (7.5K problems, 110K answers) – evaluated on the GSM8K test set.

Code Generation : Code Alpaca (20K instruction‑code pairs) – evaluated with HumanEval.

General Instruction Following : ShareGPT (≈90K multi‑turn dialogues) – evaluated with MT‑Bench.

Research Questions

The study adopts a data‑scaling perspective and investigates four questions (RQ1‑RQ4) by varying data volume, data mix, model parameters (LLaMA 7B, 13B, 33B), and SFT strategies.

RQ1 – Scaling Trends with Increasing Data

Experimental setup : For each ability, training data were sampled at ratios {1, 1/4, 1/16, 1/64, 1/256} and used to fine‑tune LLaMA models of different sizes.

Key findings

Each ability exhibits a distinct scaling curve. Math and general abilities improve monotonically with more data. Code ability shows irregular behavior for 7B/13B models but a log‑linear trend for 33B.

When data are abundant, larger models consistently outperform smaller ones.

RQ2 – Does Direct Mixing of the Three Ability Datasets Cause Conflict?

Experimental setup

Single‑source condition : Fine‑tune LLaMA on each ability dataset alone using the same data ratios.

Mixed‑source condition : Combine the three abilities in equal proportion at the same ratios and fine‑tune.

Key findings

Mixed data yields performance gains in low‑resource regimes (e.g., 1/256) but incurs conflicts in high‑resource regimes (full data), where each ability’s scaling curve underperforms the single‑source baseline.

Higher model capacity amplifies the low‑resource gains for math and general abilities.

RQ3 – Factors Driving the Observed Conflicts

Experimental setup

Fix general data, scale specific (code, math) data.

Fix specific data, scale general data.

Fix 1/64 of general data, scale specific data.

Key findings

When task formats and data distributions differ markedly (e.g., math vs. general), the proportion of each dataset has little effect on performance.

When distributions are more similar (code vs. general), the data ratio can cause noticeable performance fluctuations.

Even with severely limited general data, scaling specific‑ability data does not significantly harm general ability.

RQ4 – Influence of Different SFT Training Strategies

Training strategies explored

Multi‑task learning : Directly mix all ability datasets and treat each as a separate task.

Sequential training : Fine‑tune each ability dataset in order, placing general data last.

Mixed‑sequential training : Multi‑task learn on specific abilities first, then fine‑tune on general data.

Two‑stage Mixed‑Fine‑Tuning (DMT) : Stage 1 – multi‑task learn on specific abilities (code, math); Stage 2 – fine‑tune on a mixture of general data plus a small proportion k of specific data (k ∈ {1, 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/256}).

Key findings

Multi‑task learning preserves specific abilities but causes a sharp drop in general ability.

Both sequential strategies retain general ability but suffer catastrophic forgetting of specific abilities, especially math.

The DMT strategy with a very small k (e.g., 1/256) improves math and code performance across 7B, 13B, and 33B models and also yields modest gains for general ability.

Discussion

Semantic Representation Visualization

Layer‑15 representations of LLaMA‑13B and DMT‑13B show clear separation for math ability, while code and general abilities still exhibit overlapping clusters, indicating partial semantic collapse.

Ablation of Code and Math Samples from General Data

Using InsTag to filter out math and code samples from ShareGPT (reducing the set from 90K to 63K) does not affect the low‑resource performance boost, suggesting that diversity rather than specific content drives the gain.

Effect of Specific‑Ability Ratio k in Stage 2 of DMT

Increasing k from 0 to 1/256 improves both specific and general abilities. Raising k beyond 1/4 degrades general performance, confirming the high‑resource conflict observed in RQ2. The optimal k depends on the desired balance between abilities.

Conclusion

Mixing multiple ability datasets during SFT exhibits a “high‑resource conflict, low‑resource gain” pattern. The proposed DMT strategy—first fine‑tuning on specific abilities, then fine‑tuning on a mixture of general data plus a small fraction of specific data—effectively preserves general capability while rescuing catastrophic forgetting of specialized skills. The proportion k in the second stage should be tuned to the target application.

https://arxiv.org/pdf/2310.05492.pdf

Llama Multi-Task Learning SFT data scaling DMT Strategy Model Fine‑tuning

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Data Composition Problem in Supervised Fine‑Tuning (SFT)

Target Abilities and Evaluation Sets

Research Questions

RQ1 – Scaling Trends with Increasing Data

RQ2 – Does Direct Mixing of the Three Ability Datasets Cause Conflict?

RQ3 – Factors Driving the Observed Conflicts

RQ4 – Influence of Different SFT Training Strategies

Discussion

Semantic Representation Visualization

Ablation of Code and Math Samples from General Data

Effect of Specific‑Ability Ratio k in Stage 2 of DMT

Conclusion

Baobao Algorithm Notes

How this landed with the community

Was this worth your time?

0 Comments

Effect of Specific‑Ability Ratio k in Stage 2 of DMT