Breaking the Data Ceiling: UltraData’s 2.4 TB Tiered Dataset with the Largest L3 Math Library
UltraData presents a five‑level tiered data‑management system (L0‑L4) for large‑language‑model training, releases the world’s largest open L3 mathematics dataset (2.4 TB), validates the approach with extensive MiniCPM‑1.2B experiments showing consistent performance gains across web, multilingual, math and code domains, and opens a suite of governance tools and a community portal.
Background and Motivation – The authors observe that public high‑quality data resources are approaching saturation, making further model scaling impossible without more refined data governance. Different training stages (pre‑training, instruction tuning, etc.) demand data of varying quality, quantity, and distribution, and cost‑effective governance must balance these needs.
Tiered Data‑Management Framework (L0‑L4) – A systematic pipeline is defined:
L0 – Raw Data : Unprocessed petabyte‑scale corpora (e.g., Common Crawl, PDFs) stored as a reserve, not used directly for training.
L1 – Filtered Data : Basic heuristic cleaning and deduplication (e.g., FineWeb, DCAD) producing format‑consistent but semantically heterogeneous data.
L2 – Selected Data : Model‑based scoring and labeling to retain high‑information‑density samples (e.g., Ultra‑FineWeb, FineWeb‑edu).
L3 – Refined Data : Re‑writing, synthesis, and human annotation to generate structured, high‑quality content (e.g., Ultra‑Chat, UltraFeedback) suitable for mid‑training, SFT, RL.
L4 – Organized Data : Final validation, structuring and indexing (e.g., Wikidata, UltraData‑Arxiv) for direct RAG use.
UltraData‑Math Dataset – Using the tiered pipeline, the team built UltraData‑Math, the largest open L3 mathematics dataset (≈2.4 TB, 290 B tokens). It consists of:
L1 : 170.5 B tokens of web‑derived math content.
L2 : 33.7 B tokens of high‑quality synthetic math data.
L3 : 88 B tokens of multi‑format data (Q&A, multi‑turn dialogue, textbook‑style material).
Experimental Validation – Data Quality – The authors sampled L1‑L3 data from four domains (English web, Chinese web, math, code) and evaluated on MiniCPM‑1.2B with a 10 B‑token fast‑validation run. Results showed a monotonic performance increase from L1 to L3 across all domains, confirming that the tiered quality labels correlate with downstream model ability.
Training Strategy Comparison – Two strategies were compared on MiniCPM‑1.2B trained for 120 B tokens:
Mix Training : A single‑stage mix of L1, L2, L3 data in a 1:1:1 ratio.
Tiered Training : Three consecutive 40 B‑token stages using L1, then L2, then L3 data.
Tiered training achieved a 1.49 pp higher average score across benchmarks and consistently outperformed mix training in English, Chinese, math and code evaluations. Detailed token‑level curves showed early‑stage parity (≈24.7 pp → 28.3 pp) while tiered training continued to rise (28.35 pp → 31.66 pp, +3.31 pp) whereas mix training plateaued (28.26 pp → 30.17 pp, +1.91 pp).
Domain‑Specific Gains – On the MATH benchmark, UltraData‑Math‑L3 achieved 28.72 pp, surpassing trafilatura‑based pipelines (28.08 pp) and magic‑html (26.58 pp). In full‑scale training, UltraData‑Math‑L3 outperformed leading open math datasets (Nemotron‑CC, MegaMath, FineMath) and set new records on MATH (+3.62), GSM8K, Math‑Bench and R‑Bench‑T. Code generation (MBPP) improved by 49.27 pp, while general knowledge (MMLU) remained robust.
Tooling and Community Release – The tiered pipeline is encapsulated in open‑source tools: UltraData‑Parser (L0) – parses HTML, PDF, MathML, KaTeX, AsciiMath into LaTeX. UltraData‑Cleaner (L1) – heuristic filtering and deduplication. UltraData‑Selector (L2) – model‑based labeling and lightweight embedding classifier for quality ranking. UltraData‑Generator (L3) – multi‑style rewriting, synthesis, and knowledge‑grounded content creation. UltraData‑Organizer (L4) – structured knowledge‑base construction and verification.
All tools are hosted on the UltraData community site (https://ultradata.openbmb.cn/) and deployed on HuggingFace Spaces for instant demo access.
Impact and Outlook – By linking data quality tiers to model training phases, UltraData demonstrates a concrete “data‑model co‑evolution” pathway that yields measurable performance improvements without increasing total token budget. The open release of datasets, tools, and the community portal aims to foster collaborative data governance and accelerate progress toward more data‑efficient AGI research.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
