Can Cleaned Web Data Rival Proprietary Corpora for LLM Training?

This article analyzes whether large‑scale web crawls, when meticulously filtered and deduplicated, can match or surpass the performance of high‑quality curated datasets in training large language models, covering dataset composition, processing pipelines, experimental results, scaling‑law implications, and future data‑efficiency strategies.

NewBeeNLP
NewBeeNLP
NewBeeNLP
Can Cleaned Web Data Rival Proprietary Corpora for LLM Training?

Motivation

The authors investigate whether high‑quality proprietary corpora are necessary for large language model (LLM) pre‑training, or if a carefully cleaned web‑only dataset can achieve comparable performance. Their study is based on the TII paper The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only .

Data Types for Pre‑training

Web data – massive, multilingual, petabyte‑scale collections such as CommonCrawl, freely available from Amazon S3.

Curated high‑quality corpora – domain‑specific resources (books, code, technical reports, dialogues, etc.) that are often proprietary.

Web‑Only Approach (RefinedWeb)

The TII team built the RefinedWeb dataset from CommonCrawl, applied aggressive filtering and deduplication, and trained Falcon‑40B. The model ranked first on the OpenLLM leaderboard, matching or surpassing models trained on curated corpora.

Web Data Processing Methodology

CommonCrawl Characteristics

Highly noisy: contains adult, violent, spam, and machine‑generated content.

Petabyte‑scale: billions of pages require heuristic pre‑filtering.

Two formats: raw WARC (HTML) and WET (plain‑text). The authors preferred WARC to apply custom cleaning.

URL Filtering

Compiled a 4.6 M‑entry blacklist (mostly porn sites).

Trained a keyword‑based URL filter; added complex rules to avoid false positives on medical or cultural sites.

Explicitly retained high‑quality domains such as Wikipedia and arXiv.

Text Extraction

Only the main article body is kept; navigation, headers, footers, ads, and other boilerplate are removed using the trafilatura library.

Cleaning Pipeline

Language Identification : FastText n‑gram model (CCNet) classifies documents; non‑English pages are filtered out (≈50 % removed).

Rule‑Based Filtering : Discard lines with excessive punctuation, profanity, or suspicious tokens while minimizing bias.

ML‑Based Quality Scoring : Use Wikipedia‑linked pages as positive samples and random pages as negatives to train a classifier; retain only high‑scoring pages.

Deduplication :

Exact‑match deduplication at the sentence level (50‑token overlap).

Fuzzy deduplication with n‑gram embeddings and MinHash/SimHash; data are bucketed into 20 groups for scalable hashing.

URL‑level deduplication by splitting the crawl into 100 shards, deduping within each shard, then cross‑checking.

Experimental Results

Falcon‑40B trained on ~5 TB of filtered CommonCrawl (≈600 GB released) outperformed models trained on The Pile and other curated corpora on zero‑shot benchmarks.

Higher deduplication rates consistently boosted performance.

Training multiple epochs on web data degraded generalization, whereas curated data (e.g., code, arXiv) sometimes benefited from longer training.

A well‑cleaned web‑only dataset can rival or exceed proprietary corpora for LLM pre‑training.

Non‑Web High‑Quality Datasets (Reference)

Academic & Specialized Corpora

PubMed Central, arXiv, FreeLaw, USPTO Backgrounds, PubMed Abstracts, PhilPapers, NIH ExPORTER.

Book Corpora

Books3, Project Gutenberg, BookCorpus2.

Dialogue & Conversational Data

OpenSubtitles, Ubuntu IRC, EuroParl, YouTube Subtitles, Hacker News.

Code Corpora

GitHub (code extraction pipeline), The Stack (≈3 TB deduped, 30 languages).

Download URL: https://huggingface.co/datasets/bigcode/the-stack-dedup

Cross‑Language Dataset (ROOTS)

1.6 TB covering 59 languages (46 natural, 13 programming) used for BLOOM training. 62 % comes from community‑curated sources; 38 % from OSCAR after community‑guided filtering.

Download URL: https://huggingface.co/datasets/bigscience-data
Cleaning URL: https://github.com/bigscience-workshop/data-preparation

Scaling‑Law Perspective

Compute, not data scarcity, is the current bottleneck. Rough estimates suggest that ~24 TB of high‑quality web data could train a 1.3 kB‑parameter model, requiring ~400× more compute than LLaMA‑65B. Efficient hardware (e.g., NVIDIA DGX GH200) and smarter data utilization are needed.

Beyond Text: Multimodal Training

Incorporating visual information can reduce the amount of text required, potentially breaking traditional scaling‑law constraints.

Improving Data Utilization

Prioritize pages with >70 % text density; discard pages shorter than 10 Chinese characters.

Remove profanity, hate speech, personal information, and malformed HTML.

Normalize scripts (traditional → simplified Chinese) and strip code, CSS, JavaScript.

Apply aggressive deduplication (SimHash, MinHash) and filter repetitive or machine‑generated content.

References

trafilatura

: https://trafilatura.readthedocs.io/en/latest/

NVIDIA DGX A100 320 GB system: https://resources.nvidia.com/en-us-dgx-systems/nvidia-dgx-a100-system-40gb-datasheet-web-us

Artificial IntelligenceLLMscaling lawspretrainingWeb DataDataset Cleaning
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.