How Tiered Data Governance Supercharges LLM Training: The UltraData L0‑L4 Framework

The article presents a tiered data‑management system (L0‑L4) for large language models, explains its motivation, details each tier's processing steps, and validates its effectiveness through extensive experiments that show consistent performance gains across multiple domains and training strategies.

PaperAgent
PaperAgent
PaperAgent
How Tiered Data Governance Supercharges LLM Training: The UltraData L0‑L4 Framework

Data‑Driven AI Evolution and the Need for Tiered Data Governance

AI development has progressed through four paradigms—symbolic learning, supervised learning, unsupervised learning, and feedback learning—culminating in a data‑driven approach where scaling data size drives model capability. As public data resources approach saturation, a new “data‑model co‑evolution” stage is proposed, where models actively improve data quality and governance.

UltraData Tiered Data Management (L0‑L4)

The UltraData framework defines five hierarchical data tiers that align data processing cost and quality with the needs of different training phases:

L0 – Raw Data : Unprocessed petabyte‑scale corpora (e.g., Common Crawl, PDFs) stored for downstream processing.

L1 – Filtered Data : Basic heuristic cleaning and deduplication; syntactically clean but semantically heterogeneous.

L2 – Selected Data : High‑information samples chosen by model‑based scoring or lightweight embedding classifiers.

L3 – Refined Data : Structured, high‑quality data generated by rewriting, synthesis, and human annotation; suitable for SFT, RL, and mid‑training.

L4 – Organized Data : Fully curated, verified, and indexed data ready for retrieval‑augmented generation (RAG) applications.

The design balances three motivations: (1) public data is nearing a plateau, requiring finer‑grained governance; (2) different training stages demand distinct data quality, quantity, and distribution; (3) governance cost must be matched to model benefit, using lightweight methods early and more expensive LLM‑based labeling later.

Experimental Validation of Tiered Governance

Four domains (English web, Chinese web, mathematics, code) were evaluated using the MiniCPM‑1.2B model. For each tier (L1‑L3) a 10 B‑token sample was drawn and benchmarked, consistently showing L3 > L2 > L1 performance.

Two training schedules were compared:

Mixed training : a single 120 B‑token phase mixing L1‑L3 data in a 1:1:1 ratio.

Tiered training : three consecutive 40 B‑token phases using L1, then L2, then L3 data.

Tiered training yielded an average gain of 1.49 pp over mixed training, with larger improvements in later stages (up to +3.31 pp ), confirming that progressive data quality boosts learning efficiency.

UltraData‑Math Dataset Construction

The tiered pipeline was applied to build a large‑scale mathematics pre‑training corpus:

L0‑Parser : UltraData‑Math‑Parser (based on magic‑html) normalizes MathML, KaTeX, and AsciiMath to LaTeX.

L1 : 170.5 B tokens of raw web math content after basic cleaning.

L2 : 33.7 B tokens of high‑quality math data selected by model‑based scoring.

L3 : 88 B tokens of synthesized multi‑format data (Q&A, dialogues, textbooks) generated by UltraData‑Math‑Generator.

Models trained on UltraData‑Math‑L3 outperformed leading open‑source math datasets (Nemotron‑CC, MegaMath, FineMath) on benchmarks:

MATH: +0.64

GSM8K, Math‑Bench, R‑Bench‑T: substantial gains

MBPP (code generation): +49.27 pp

General knowledge (MMLU) remained strong.

Full‑Stack Governance Tools

The UltraData platform bundles open‑source utilities for each tier: UltraData‑Parser: raw data parsing (web, PDF, etc.) UltraData‑Cleaner: heuristic filtering and deduplication UltraData‑Selector: model‑based scoring and lightweight embedding classification UltraData‑Generator: data rewriting, synthesis, and human‑in‑the‑loop refinement UltraData‑Organizer: knowledge‑base construction, verification, and indexing for RAG

All tools are released on HuggingFace Spaces, enabling immediate experimentation without local deployment.

Key Findings

Tiered data governance provides a systematic way to allocate high‑value data to later training phases, maximizing marginal performance gains under a fixed token budget.

Empirical results on MiniCPM‑1.2B demonstrate that progressive quality improvements (L1→L2→L3) consistently translate into higher benchmark scores across diverse domains.

The UltraData‑Math pipeline produces the largest publicly available high‑quality math dataset, delivering state‑of‑the‑art performance on multiple mathematical reasoning benchmarks.

PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.