DataFlex: An Industrial‑Grade Dynamic Data Training System for Large Models
DataFlex, built on LLaMA‑Factory, offers a unified, reproducible infrastructure that dynamically selects, mixes, and re‑weights training data, turning data into a controllable optimization object and delivering measurable gains in training efficiency and model performance for large‑scale AI models.
Motivation
When large‑model training reaches a stage where model capacity is limited, the decisive factor becomes the data that the model sees, the proportion of each source, and the frequency with which specific samples are presented. Existing academic methods for data selection, mixing, and sample re‑weighting are scattered across independent repositories, making integration, reproducibility, and fair comparison difficult.
Design Principles
Uniformity – the framework unifies three representative data‑centric training paradigms (dynamic selection, dynamic mixing, dynamic weighting) within a single system.
Compatibility – it builds on LLaMA‑Factory and can be inserted into existing large‑scale training pipelines without introducing a separate workflow.
Scalability – new data‑centric algorithms can be implemented and compared with low engineering overhead.
Architecture
DataFlex extends LLaMA‑Factory with a three‑layer architecture:
Base Layer – inherits model management, data processing, optimizer support, and other generic training capabilities from LLaMA‑Factory.
Trainer Layer – abstracts the training loop into three data‑centric modes (selection, mixing, weighting) so that the trainer handles both parameter updates and data‑related decisions.
Component Layer – hosts concrete algorithm components (selectors, mixers, weighters) that expose a unified interface to the trainer.
This design enables lightweight replacement of the training core while preserving existing models, datasets, and hyper‑parameters.
Core Trainers
Dynamic Select Trainer – filters high‑value samples on the fly, reducing waste on low‑value data and improving training efficiency.
Dynamic Mix Trainer – dynamically adjusts the sampling ratio of multiple data sources according to the model’s current learning state.
Dynamic Weight Trainer – assigns varying training weights to samples, allowing the model to focus on critical, difficult, or representative examples.
Integrated Algorithms
DataFlex provides plug‑in implementations of representative methods, including LESS, DoReMi, ODM, and loss re‑weighting. All methods share a common interface, enabling controlled, reproducible comparative experiments.
Usage
Configuration compatibility – add DataFlex parameters to existing LLaMA‑Factory YAML files.
Command consistency – invoke dataflex-cli instead of llamafactory-cli.
Feature preservation – all original LLaMA‑Factory functionalities remain available.
Seamless fallback – set train_type: static to revert to the original static training mode.
Experimental Evaluation
Data Selection & Sample Weighting
On the Open‑Hermes‑2.5 subset, both Mistral‑7B and Llama‑3.2‑3B showed that dynamic data‑centric methods consistently outperformed the static full‑data baseline, confirming the importance of real‑time data‑aware selection for limited‑capacity models.
Data Mixing
Using SlimPajama (6B and 30B token regimes), DoReMi and ODM demonstrated clear advantages. For the 6B token regime, ODM achieved higher accuracy on general‑ability benchmarks than static mixing, while DoReMi reduced overall perplexity, indicating that dynamic domain‑aware mixing yields tangible training gains.
System Efficiency
With the LESS method on a single GPU, training time dropped from 30,239 s to 28,734 s while accuracy rose from 40.38 % to 42.37 %. On an 8‑GPU H20 cluster, total training time decreased by 57.13 %. For offline selection methods such as TSDS, DataFlex achieved stable 1 %–3.5 % speed‑ups across different data scales.
Resources
Technical report: https://arxiv.org/abs/2603.26164
Documentation: https://opendcai.github.io/DataFlex-Doc/
GitHub repository: https://github.com/OpenDCAI/DataFlex
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
