Can 10% of Instruction Data Match Full-Scale Fine-Tuning? The SPICE Solution
The SPICE method leverages Fisher Information Matrix submodularity and a novel gradient‑conflict penalty to select a small, high‑quality subset of instruction‑tuning data, achieving comparable or superior performance to full‑data fine‑tuning while dramatically reducing training cost.
Instruction tuning is the final step before deploying large language models, where massive (instruction, response) pairs are used to fine‑tune the model. However, more data does not always improve performance; redundant, noisy, or conflicting samples increase training cost and can even degrade learning.
Why Fisher Information?
Recent work proposes measuring the “information amount” of each sample with the Fisher Information Matrix (FIM). A sample whose gradient provides a large, independent direction for parameter updates is considered valuable.
Submodularity and Greedy Selection
The FIM‑based objective is submodular, guaranteeing that a simple greedy algorithm can achieve near‑optimal solutions. In practice, however, the marginal information gain often drops sharply after a few selections.
Gradient Conflict as the Missing Factor
Analysis shows that the rapid decay is caused by gradient conflict: different samples produce gradients pointing in inconsistent or opposite directions, causing their information contributions to cancel out.
SPICE: Submodular Penalized Information‑Conflict Selection
SPICE augments the Fisher‑information greedy objective with a conflict penalty. The method consists of three steps:
Decompose the marginal gain into a base term (individual information) and an interaction term (effect of already‑selected samples).
Estimate gradient conflict efficiently by maintaining the average gradient direction of the current set and measuring cosine similarity with candidate samples.
Apply a soft penalty proportional to the conflict score, together with adaptive early stopping and a proxy‑model mechanism to keep computation tractable.
Efficient Proxy Selection
Because computing gradients on the full model is expensive, SPICE uses a smaller model with the same architecture as a proxy and updates the proxy at reasonable intervals, exploiting the observation that data‑selection patterns transfer across model scales.
Experimental Validation
Experiments on ~97.5 K instruction examples (math reasoning, code generation, general instruction following) with LLaMA‑2‑7B and Qwen‑2‑7B show that selecting only ~10 % of the data with SPICE matches or exceeds full‑data fine‑tuning on benchmarks such as GSM8K, MMLU, IFEval, and HumanEval, while cutting training cost dramatically.
Key findings:
Lower gradient conflict correlates with slower marginal‑gain decay and larger cumulative Fisher information.
SPICE‑selected subsets start with higher initial loss (harder examples) but converge faster and more stably.
Conclusion
SPICE demonstrates that accounting for both sample‑model information and sample‑sample gradient interactions yields a simple, practical solution for instruction‑tuning data selection, enabling small, high‑quality subsets to replace large, noisy corpora.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
