How LLaMA Preprocesses Training Data with CCNet Before Model Training
Before training large language models like LLaMA, MetaAI applies a multi‑stage CCNet pipeline that crawls web data, stores it in WET format, deduplicates paragraphs, detects and filters languages using fastText, and further refines content by similarity to Wikipedia and citation‑based linear models.
1. Data Crawling and Storage
Training LLaMA uses unlabelled web text. Common Crawl contributes about 3.3 TB of raw text. The data are stored in WARC, WAT, and WET files; LLaMA consumes the WET files, which contain extracted plain text without images.
2. Deduplication
Duplicate paragraphs can occupy up to 70 % of the raw corpus. Each Common Crawl snapshot is ~300 TB; a single WET snapshot is ~30 TB. Meta AI shards the data into 5 GB pieces (CCNet shards). For each shard the preprocessing normalizes text: characters are lower‑cased, digits are replaced by the placeholder “0”, and all Unicode punctuation and diacritic marks are removed.
Each paragraph is hashed with the first 64 bits of a SHA‑1 digest and stored in a binary file. Comparing these binary hash files across shards identifies and removes duplicate paragraphs, enabling parallel processing and higher throughput.
3. Language Identification and Filtering
CCNet applies a fastText n‑gram classifier that supports 176 languages. The classifier outputs a confidence score in the range [0, 1] for each language and processes roughly 1 k documents per second on a single CPU core. If every language score for a page is below 0.5, the page is considered ambiguous and discarded.
4. Per‑Page Quality Assessment
After language filtering, a small model evaluates the similarity of each page to Wikipedia. A 5‑gram Kneser‑Ney language model trained on Wikipedia tokenizes each page and computes perplexity for each paragraph. Paragraphs are segmented according to the perplexity distribution; low‑perplexity (high‑quality) paragraphs are kept while high‑perplexity (low‑quality or off‑topic) paragraphs are removed.
5. Additional Filtering (Beyond CCNet)
LLaMA’s paper describes a final filtering step not present in CCNet. Researchers construct a dataset containing pages cited by Wikipedia and a set of randomly sampled pages. A linear model is trained to predict whether a page is Wikipedia‑cited; pages classified as not cited are discarded, further improving overall data quality.
Code example
- WARC (Web ARChive): 它是一种用于存储和传输Web资源(例如HTML页面,图像和视频文件等)的文件格式。WARC文件通常包含HTTP响应和元数据,用于记录Web爬虫收集的信息。
- WAT (Web Archive Transformation): 它是一种元数据文件格式,用于描述WARC文件中记录的Web内容。WAT文件通常包含URL,域名和其他有关记录的元数据信息。
- WET (Web Extraction Toolkit): 它是一种将HTML页面转换为文本格式的文件格式。WET文件通常包含从HTML页面中提取的文本内容,但不包括图像和其他资源。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
