Artificial Intelligence 15 min read

Exploration and Practice of Large‑Model Data Construction

This presentation details engineering‑focused approaches to building, mixing, and filtering data for large language models, covering data preparation, pre‑training mix strategies such as DoReMi, DoGE and online sampling, post‑training data quality selection methods, and practical Q&A on scaling laws and PDF processing.

DataFunSummit
DataFunSummit
DataFunSummit
Exploration and Practice of Large‑Model Data Construction

Overview : The talk, titled “Exploration and Practice of Large Model Data Construction,” introduces the importance of data engineering for large‑scale models, outlining the workflow from data preparation through pre‑training, post‑training, and dynamic data versioning.

01 From an Engineering Perspective on Data Construction

The engineering process involves three stages—data preparation, pre‑training, and post‑training—each with challenges such as long pre‑training cycles and the need for continuous data version adjustments. Lead models (e.g., 1B‑parameter pilots) are used to monitor data changes and guide adjustments.

1. Data Preparation and Training Flow

Real‑world training deviates from the ideal three‑step pipeline, requiring dynamic data version updates and iterative refinement during both pre‑training and post‑training.

2. Factors Influencing Data Version Updates

Changes in data sources affect composition and quality.

Introducing new data and adjusting domain ratios are essential but costly.

Natural vs. ideal data mix discrepancies require continual re‑balancing.

3. Pilot Model Applications

Small pilot models (≈1B parameters) are trained to experiment with data deduplication, cleaning, and mix adjustments; multi‑level pilots may be employed for finer control.

02 Pre‑training Data Mix Strategies

Effective data mixing (DoReMi, DoGE, online domain sampling) significantly improves training efficiency and downstream performance. DoReMi optimizes mix weights via a small reference model and a Group‑DRO proxy; DoGE uses a bi‑level optimizer; online sampling treats data domains as multi‑armed bandit arms.

3. Characteristics and Limitations of Existing Methods

Data set changes impact model performance, and many methods ignore downstream evaluation, leading to sub‑optimal results for specific tasks.

4. Engineering Practices for Mix Adjustment

Trigger mix updates proactively during early and mid‑training based on evaluation metrics.

Use pilot models (e.g., DoReMi) to guide adjustments.

Avoid high‑cost algorithms as training progresses.

5. Experience Summary

Increasing Chinese data proportion (~40%) boosts performance on English benchmarks.

Raising math and code data ratios (e.g., 25% math, 17% code) improves specialized abilities.

Specialized parsers are needed for extracting formulas and tables from PDFs.

03 Post‑training Data Quality Filtering

Transitioning from quantity‑centric to quality‑centric data selection is crucial. Various filtering methods—CherryLLM, Superfiltering, MoDS, NUGGETS, LESS—evaluate data based on instruction‑following difficulty, reward model scores, or gradient impact.

Classification of Methods

Methods split into model‑based (high compute) and metric‑based (fast but potentially noisy) approaches.

Ideal Metric Exploration

Desired metrics should be precise, deterministic, and dynamically decreasing as training progresses.

Engineering Practices

Classify data by logical, cognitive, and comprehension levels.

Iteratively optimize mix across single‑skill and cross‑skill stages.

Multiple refinement rounds improve overall model performance.

04 Q&A Session

Key takeaways: scaling laws apply to data volume but quality is paramount; math data mainly comes from problem banks; training order of domains matters; pilot models can differ structurally from main models; PDF processing requires specialized layout, table, and formula recognizers.

Thank you for attending.

Data EngineeringAIlarge language modelsModel ScalingPretrainingdata mixingposttraining
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.