Twin Networks Reveal How to Optimize Data Mixtures for Large Language Models

This article presents TANDEM, a bi‑level data‑mixture optimization framework that uses twin networks to automatically adjust domain‑specific training data ratios, offering theoretical guarantees, broader applicability, and significant performance gains across pre‑training, fine‑tuning, and e‑commerce product‑understanding tasks.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Twin Networks Reveal How to Optimize Data Mixtures for Large Language Models

Background

Large language models (LLMs) rely heavily on the composition of training data from various domains and tasks. Traditional approaches manually tune data mixing ratios or use costly trial‑and‑error, which is inefficient and often sub‑optimal.

Method Overview (TANDEM)

The authors formulate data‑mixture ratio selection as a bi‑level optimization problem and simplify it to a single‑level penalty formulation. They solve it with a pair of twin models:

A proxy model u trained on the current data mixture for K steps.

A reference model w trained on the same mixture for K steps on both training and validation sets.

The loss difference between the two models on each domain quantifies the benefit of adding more data from that domain; larger differences increase the domain’s weight, while smaller differences decrease it.

The iterative process updates the mixture coefficients α until convergence, after which the final data mixture is used to train the target LLM.

Theoretical Guarantees

The paper provides convergence analysis for the penalty‑based single‑level reformulation, ensuring that the learned mixture ratios approach a stationary point of the original bi‑level problem.

Comparison with Existing Methods

TANDEM unifies prior data‑mixing strategies such as DoReMi (static reference model) and DoGE (reference model updated after a single step). Table 1 shows that TANDEM subsumes these methods as special cases.

When Does Data‑Mixture Adjustment Help?

It is provable that uniform sampling is optimal when all domains have balanced and abundant data. Significant gains arise when data are imbalanced or when multi‑stage training (pre‑training followed by fine‑tuning) is employed.

Experimental Results

General Tasks

Experiments on pre‑training and supervised fine‑tuning across multiple model scales demonstrate that TANDEM consistently outperforms baseline mixing strategies. Figures 1‑3 and Tables 1‑2 illustrate superior validation performance and faster convergence.

E‑commerce Product Understanding

Applying TANDEM to JD.com’s product‑understanding pipeline (including product identification, attribute extraction, brand detection, etc.) yields noticeable improvements over the default data mixture, as shown in Figure 2.

Conclusion

TANDEM offers a principled, automated way to optimize domain‑specific data mixtures for LLM training, with theoretical backing and empirical validation across diverse scenarios.

large language modelsNeurIPSbi-level optimizationdata mixture optimizationtwin networks
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.