Exploration and Practice of Large‑Model Data Construction
This presentation details engineering‑focused approaches to building, mixing, and filtering data for large language models, covering data preparation, pre‑training mix strategies such as DoReMi, DoGE and online sampling, post‑training data quality selection methods, and practical Q&A on scaling laws and PDF processing.
Overview : The talk, titled “Exploration and Practice of Large Model Data Construction,” introduces the importance of data engineering for large‑scale models, outlining the workflow from data preparation through pre‑training, post‑training, and dynamic data versioning.
01 From an Engineering Perspective on Data Construction
The engineering process involves three stages—data preparation, pre‑training, and post‑training—each with challenges such as long pre‑training cycles and the need for continuous data version adjustments. Lead models (e.g., 1B‑parameter pilots) are used to monitor data changes and guide adjustments.
1. Data Preparation and Training Flow
Real‑world training deviates from the ideal three‑step pipeline, requiring dynamic data version updates and iterative refinement during both pre‑training and post‑training.
2. Factors Influencing Data Version Updates
Changes in data sources affect composition and quality.
Introducing new data and adjusting domain ratios are essential but costly.
Natural vs. ideal data mix discrepancies require continual re‑balancing.
3. Pilot Model Applications
Small pilot models (≈1B parameters) are trained to experiment with data deduplication, cleaning, and mix adjustments; multi‑level pilots may be employed for finer control.
02 Pre‑training Data Mix Strategies
Effective data mixing (DoReMi, DoGE, online domain sampling) significantly improves training efficiency and downstream performance. DoReMi optimizes mix weights via a small reference model and a Group‑DRO proxy; DoGE uses a bi‑level optimizer; online sampling treats data domains as multi‑armed bandit arms.
3. Characteristics and Limitations of Existing Methods
Data set changes impact model performance, and many methods ignore downstream evaluation, leading to sub‑optimal results for specific tasks.
4. Engineering Practices for Mix Adjustment
Trigger mix updates proactively during early and mid‑training based on evaluation metrics.
Use pilot models (e.g., DoReMi) to guide adjustments.
Avoid high‑cost algorithms as training progresses.
5. Experience Summary
Increasing Chinese data proportion (~40%) boosts performance on English benchmarks.
Raising math and code data ratios (e.g., 25% math, 17% code) improves specialized abilities.
Specialized parsers are needed for extracting formulas and tables from PDFs.
03 Post‑training Data Quality Filtering
Transitioning from quantity‑centric to quality‑centric data selection is crucial. Various filtering methods—CherryLLM, Superfiltering, MoDS, NUGGETS, LESS—evaluate data based on instruction‑following difficulty, reward model scores, or gradient impact.
Classification of Methods
Methods split into model‑based (high compute) and metric‑based (fast but potentially noisy) approaches.
Ideal Metric Exploration
Desired metrics should be precise, deterministic, and dynamically decreasing as training progresses.
Engineering Practices
Classify data by logical, cognitive, and comprehension levels.
Iteratively optimize mix across single‑skill and cross‑skill stages.
Multiple refinement rounds improve overall model performance.
04 Q&A Session
Key takeaways: scaling laws apply to data volume but quality is paramount; math data mainly comes from problem banks; training order of domains matters; pilot models can differ structurally from main models; PDF processing requires specialized layout, table, and formula recognizers.
Thank you for attending.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.