Network Intelligence Research Center (NIRC)
May 10, 2023 · Artificial Intelligence
How LLaMA Preprocesses Training Data with CCNet Before Model Training
Before training large language models like LLaMA, MetaAI applies a multi‑stage CCNet pipeline that crawls web data, stores it in WET format, deduplicates paragraphs, detects and filters languages using fastText, and further refines content by similarity to Wikipedia and citation‑based linear models.
CCNetLLaMAdata preprocessing
0 likes · 7 min read
