Exploration and Practice of Large‑Model Data Construction
This presentation details engineering‑focused approaches to building, mixing, and filtering data for large language models, covering data preparation, pre‑training mix strategies such as DoReMi, DoGE and online sampling, post‑training data quality selection methods, and practical Q&A on scaling laws and PDF processing.