Artificial Intelligence 14 min read

DataFlex: An Industrial‑Grade Dynamic Data Training System for Large Models

DataFlex, built on LLaMA‑Factory, offers a unified, reproducible infrastructure that dynamically selects, mixes, and re‑weights training data, turning data into a controllable optimization object and delivering measurable gains in training efficiency and model performance for large‑scale AI models.

Machine Heart

Apr 15, 2026

DataFlex: An Industrial‑Grade Dynamic Data Training System for Large Models

Motivation

When large‑model training reaches a stage where model capacity is limited, the decisive factor becomes the data that the model sees, the proportion of each source, and the frequency with which specific samples are presented. Existing academic methods for data selection, mixing, and sample re‑weighting are scattered across independent repositories, making integration, reproducibility, and fair comparison difficult.

Design Principles

Uniformity – the framework unifies three representative data‑centric training paradigms (dynamic selection, dynamic mixing, dynamic weighting) within a single system.

Compatibility – it builds on LLaMA‑Factory and can be inserted into existing large‑scale training pipelines without introducing a separate workflow.

Scalability – new data‑centric algorithms can be implemented and compared with low engineering overhead.

Architecture

DataFlex extends LLaMA‑Factory with a three‑layer architecture:

Base Layer – inherits model management, data processing, optimizer support, and other generic training capabilities from LLaMA‑Factory.

Trainer Layer – abstracts the training loop into three data‑centric modes (selection, mixing, weighting) so that the trainer handles both parameter updates and data‑related decisions.

Component Layer – hosts concrete algorithm components (selectors, mixers, weighters) that expose a unified interface to the trainer.

This design enables lightweight replacement of the training core while preserving existing models, datasets, and hyper‑parameters.

Core Trainers

Dynamic Select Trainer – filters high‑value samples on the fly, reducing waste on low‑value data and improving training efficiency.

Dynamic Mix Trainer – dynamically adjusts the sampling ratio of multiple data sources according to the model’s current learning state.

Dynamic Weight Trainer – assigns varying training weights to samples, allowing the model to focus on critical, difficult, or representative examples.

Integrated Algorithms

DataFlex provides plug‑in implementations of representative methods, including LESS, DoReMi, ODM, and loss re‑weighting. All methods share a common interface, enabling controlled, reproducible comparative experiments.

Usage

Configuration compatibility – add DataFlex parameters to existing LLaMA‑Factory YAML files.

Command consistency – invoke dataflex-cli instead of llamafactory-cli.

Feature preservation – all original LLaMA‑Factory functionalities remain available.

Seamless fallback – set train_type: static to revert to the original static training mode.

Experimental Evaluation

Data Selection & Sample Weighting

On the Open‑Hermes‑2.5 subset, both Mistral‑7B and Llama‑3.2‑3B showed that dynamic data‑centric methods consistently outperformed the static full‑data baseline, confirming the importance of real‑time data‑aware selection for limited‑capacity models.

Data Mixing

Using SlimPajama (6B and 30B token regimes), DoReMi and ODM demonstrated clear advantages. For the 6B token regime, ODM achieved higher accuracy on general‑ability benchmarks than static mixing, while DoReMi reduced overall perplexity, indicating that dynamic domain‑aware mixing yields tangible training gains.

System Efficiency

With the LESS method on a single GPU, training time dropped from 30,239 s to 28,734 s while accuracy rose from 40.38 % to 42.37 %. On an 8‑GPU H20 cluster, total training time decreased by 57.13 %. For offline selection methods such as TSDS, DataFlex achieved stable 1 %–3.5 % speed‑ups across different data scales.

Resources

Technical report: https://arxiv.org/abs/2603.26164

Documentation: https://opendcai.github.io/DataFlex-Doc/

GitHub repository: https://github.com/OpenDCAI/DataFlex

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models LLaMA-Factory Training Efficiency Data‑Centric AI DataFlex Dynamic Data Training

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.