Understanding Supervised, Unsupervised, Self‑Supervised, Semi‑Supervised, and Reinforcement Learning for Large Language Model Training
The article explains various learning paradigms (supervised, unsupervised, self‑supervised, semi‑supervised, and reinforcement), describes dataset types and quality considerations, outlines preprocessing steps like filtering, deduplication, and tokenization, and discusses scaling laws linking model size, data volume, and compute resources, with concrete examples and code.
Learning Paradigms
Supervised learning : The model is trained on labeled examples by minimizing a loss computed from true and predicted values. It requires costly annotation and is therefore used when the dataset is relatively small.
Unsupervised learning : The model discovers structure in unlabeled data (e.g., clustering, PCA). The loss is defined by the intrinsic data distribution rather than external labels.
Self‑supervised learning : A subset of unsupervised learning that automatically creates pseudo‑labels from the data itself. The canonical example is causal language modeling (CLM) where the task is to predict the next token.
Input processing : Extract a continuous text segment from massive raw corpora, tokenize it, and build an input sequence.
原文:I love natural language processing
输入序列拆解为:I → love → natural → language → processingAutomatic pseudo‑label generation : Use the next token in the sequence as the label.
Output prediction : Perform a forward pass and output a probability distribution over the vocabulary for the next token.
Loss computation and back‑propagation : Compute cross‑entropy loss between the predicted distribution and the pseudo‑label, then back‑propagate to update parameters.
Self‑supervised learning combines the low cost of unlabeled data with a clear optimization target.
Semi‑supervised learning : Train first on a small labeled set, generate high‑confidence pseudo‑labels for unlabeled samples, then train on the combined data with a weighted loss (supervised + unsupervised).
Reinforcement learning (RL) : An agent interacts with an environment and receives reward signals. The objective is to maximize cumulative reward rather than minimize a static loss.
Dataset Types
General text data : Used for foundational pre‑training to learn basic language understanding.
Domain‑specific text data : Fine‑tunes the model on specialized knowledge (e.g., finance, medicine).
Private text data : Tailors the model to internal enterprise knowledge during downstream fine‑tuning.
Unlabeled data : Massive corpora such as web pages, books, scientific articles, and code. Typical scale reaches trillions of tokens.
Labeled data : High‑quality manually annotated samples for supervised fine‑tuning (from thousands to millions of examples).
Common unlabeled sources include CommonCrawl, dialogue corpora, books, scientific texts, and GitHub code. Labeled examples include supervised fine‑tuning (SFT) datasets and RL reward datasets.
Data Quality and Model Performance
Higher‑quality data yields lower loss and better downstream metrics. Experiments compare filtered corpora (e.g., OpenWebText, C4) with raw web data (MassiveWeb Unfiltered). Applying a pipeline of quality filtering, exact deduplication, and fuzzy deduplication progressively improves loss.
Typical preprocessing steps:
Filtering : Classifier‑based (e.g., BERT scorer) or rule‑based methods using language, length, perplexity, or keyword lists.
# Example of a simple rule‑based filter in Python
if len(text) < 200:
continue # discard short documents
if perplexity(text) > 50:
continue # discard high‑perplexity sentencesDeduplication : Sentence‑level, document‑level, or dataset‑level using MinHash/SimHash.
#!/usr/bin/env python3
import numpy as np
words = ['不能','复现','的','软件','不算','开源软件']
def encode_word(hash_int):
bin_hash = bin(hash_int)[2:].zfill(128)[-128:]
return np.array([1 if b=='1' else -1 for b in bin_hash])
embeddings = [encode_word(hash(w)) for w in words]
print(np.sum(embeddings))Tokenization : Convert text into tokens or sub‑words (BPE, SentencePiece, WordPiece). Sub‑word tokenization mitigates the unknown‑word problem.
Scaling Laws
Loss L of large language models follows a power‑law relationship with model parameters N, training tokens D, and compute C:
N (parameters) : Larger models achieve lower loss; e.g., increasing from 100 M to 1 B parameters yields super‑linear loss reduction.
C (compute) : More FLOPs improve performance following an inverse power‑law.
D (data) : More tokens consistently lower loss up to a point; diminishing returns appear when compute is fixed.
Compute‑optimal training suggests scaling N and D proportionally (e.g., an 8× increase in parameters requires at least a 5× increase in tokens).
Open‑Source Datasets
Pile : 22 high‑quality subsets (~150 B tokens). URL: https://huggingface.co/datasets/Skywork/SkyPile-150B
ROOTS : 46 natural‑language and 13 programming languages (~1.6 TB). URL: https://data.baai.ac.cn/dataset
RefinedWeb : Filtered CommonCrawl data (≈11.67 % of >1 PB raw). URL: https://huggingface.co/datasets/RefinedWeb
SlimPajama : Cleaned version of RedPajama (~627 B tokens).
Typical preprocessing for these corpora includes NFC normalization, short‑document filtering, multi‑stage deduplication, interleaving of sources, and train/holdout splits to avoid leakage.
Data Preprocessing Pipeline
Data Filtering
Classifier‑based : Train a BERT‑style scorer on high‑quality reference texts (e.g., Wikipedia, books) and filter out low‑scoring samples.
Rule‑based : Apply heuristics such as language detection, length thresholds, perplexity limits, statistical feature thresholds, and keyword blacklists.
Deduplication
Sentence‑level : Remove identical or near‑identical sentences across documents.
Document‑level : Compute similarity scores (e.g., MinHash) and discard highly similar documents.
Dataset‑level : Ensure no overlap between training and holdout sets, especially when multiple sources (e.g., GitHub, Wikipedia) are combined.
Example using SimHash for batch deduplication:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
words = ['不能','复现','的','软件','不算','开源软件']
def encode_word(hash_int):
bin_hash = bin(hash_int)[2:].zfill(128)[-128:]
return np.array([1 if b=='1' else -1 for b in bin_hash])
embeddings = [encode_word(hash(w)) for w in words]
vector = np.sum(embeddings)
print(vector) # Hamming‑based similarity can be derived from this vectorTokenization
Tokenization converts raw text into a sequence of tokens or sub‑words. Common algorithms:
BPE (Byte‑Pair Encoding) : Iteratively merges the most frequent character pairs to build a sub‑word vocabulary.
SentencePiece : Learns a unigram language model or BPE without requiring pre‑tokenization.
WordPiece : Used in models like BERT; merges tokens based on likelihood maximization.
After tokenization, corpora are typically stored in binary formats (e.g., .bin, .hdf5) and sharded (1 B, 2 B, 4 B token chunks) to enable parallel loading.
Dataset Quality Evaluation Criteria
Relevance : Data should be closely aligned with the target task.
Accuracy : Annotations must be correct and authoritative.
Diversity : Include varied domains, styles, and languages.
Consistency : Uniform formatting and tone across samples.
Cleanliness : Remove noise, duplicates, and malformed entries.
Appropriate scale : Sufficient quantity to avoid under‑fitting, but not so large that low‑quality data dominates.
Scaling‑Law Extensions
Empirical studies (OpenAI 2020, DeepMind) show that loss L follows: L(N, D, C) ≈ α·N^{-β} + γ·D^{-δ} + ε·C^{-ζ} Typical coefficients: α≈0.076, β≈0.095, γ≈0.05. The three axes must be balanced; expanding only one dimension yields diminishing returns. Compute‑optimal scaling recommends proportional growth of model size and token count (e.g., an 8× increase in parameters requires at least a 5× increase in tokens).
When compute C is fixed, there exists an optimal data size D beyond which loss improvement plateaus.
Example ratios for major models: GPT‑3 (175 B parameters) was trained on ~300 B tokens (≈1.7 × token‑to‑parameter ratio).
Key Open‑Source Data Processing Practices
NFC Normalization : Remove non‑Unicode characters.
Short‑Document Filtering : Discard documents shorter than 200 characters.
Global Deduplication : Build MinHashLSH indexes, query for duplicates, construct graphs of duplicate clusters, and filter each component.
Interleaving & Shuffling : Mix multiple data sources using predefined weights to achieve desired diversity.
Train/Holdout Split : Create a holdout set for evaluation and ensure no overlap with the training set.
Deduplication Across Splits : Remove any document that appears in both train and holdout partitions.
These steps are essential for building high‑quality LLM training corpora that yield strong generation and understanding capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
