Tagged articles
1 articles
Page 1 of 1
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
May 10, 2023 · Artificial Intelligence

How LLaMA Preprocesses Training Data with CCNet Before Model Training

Before training large language models like LLaMA, MetaAI applies a multi‑stage CCNet pipeline that crawls web data, stores it in WET format, deduplicates paragraphs, detects and filters languages using fastText, and further refines content by similarity to Wikipedia and citation‑based linear models.

CCNetLLaMAdata preprocessing
0 likes · 7 min read
How LLaMA Preprocesses Training Data with CCNet Before Model Training