Mastering LLM Fundamentals: Tokenizers, Layer Norm, and PEFT Explained
This article provides a comprehensive technical guide on large language model fundamentals, covering tokenizer construction methods such as BPE, WordPiece, and SentencePiece, detailed explanations of Layer Normalization variants, Deep Norm concepts with code, and an overview of parameter‑efficient fine‑tuning techniques like LoRA and PEFT.
LLM Tokenizer Overview
Tokenizers split raw text into sub‑word units that large language models (LLMs) can process. Three widely used unsupervised methods are described:
Byte‑Pair Encoding (BPE) : Start with a character‑level vocabulary, count token frequencies, repeatedly merge the most frequent adjacent token pair, and update the vocabulary until a target size is reached. This iterative merging creates a dictionary that can represent both common words and rare or out‑of‑vocabulary tokens.
WordPiece : Similar to BPE but uses a greedy top‑down splitting strategy. It selects the best sub‑word split for each word based on maximum likelihood and token frequency, often marking continuation pieces with a "##" prefix. WordPiece emphasizes a fixed vocabulary size and efficient lookup.
SentencePiece : Builds on BPE while treating the input as a raw byte sequence, allowing language‑agnostic tokenization. It follows the same iterative merge steps as BPE but adds optional pre‑tokenization tricks and can output both sub‑word and character‑level tokens. SentencePiece supports multiple languages and custom training configurations.
Each method improves handling of rare words and reduces the overall vocabulary size, which benefits model training and inference efficiency.
Layer Normalization Variants
Layer Normalization (Layer Norm) stabilizes training by normalizing activations across the feature dimension for each sample.
Layer Norm Formula
Given an input vector x \in \mathbb{R}^d, the normalized output is:
\hat{x}_i = \frac{x_i - \mu}{\sigma} \quad\text{where}\quad \mu = \frac{1}{d}\sum_{j=1}^{d} x_j,\; \sigma = \sqrt{\frac{1}{d}\sum_{j=1}^{d}(x_j-\mu)^2}Learnable scale \gamma and shift \beta are then applied: y_i = \gamma \hat{x}_i + \beta.
RMS Norm
RMS Norm replaces the variance term with the root‑mean‑square of the input:
r = \sqrt{\frac{1}{d}\sum_{j=1}^{d} x_j^2 + \epsilon}\;\hat{x}_i = \frac{x_i}{r}It is computationally cheaper and often used in recurrent architectures.
Deep Norm
Deep Norm inserts a normalization layer after every hidden layer of a deep network, aiming to reduce internal covariate shift throughout the entire depth. This improves gradient flow, accelerates convergence, and reduces sensitivity to learning‑rate choices.
Deep Norm Code Example (PyTorch)
import torch
import torch.nn as nn
class DeepNorm(nn.Module):
def __init__(self, input_dim, hidden_dims, output_dim):
super(DeepNorm, self).__init__()
self.layers = nn.ModuleList()
self.norm_layers = nn.ModuleList()
for hidden_dim in hidden_dims:
self.layers.append(nn.Linear(input_dim, hidden_dim))
self.norm_layers.append(nn.LayerNorm(hidden_dim))
input_dim = hidden_dim
self.output_layer = nn.Linear(input_dim, output_dim)
def forward(self, x):
for layer, norm in zip(self.layers, self.norm_layers):
x = torch.relu(norm(layer(x)))
return self.output_layer(x)
# Example usage
model = DeepNorm(100, [64, 32], 10)
output = model(torch.randn(32, 100))Deep Norm offers benefits such as smoother gradient propagation, better generalization, reduced learning‑rate sensitivity, simpler architecture (no separate batch‑norm layers), and improved interpretability of activations.
Parameter‑Efficient Fine‑Tuning (PEFT)
PEFT techniques enable adapting large models with minimal additional parameters and reduced GPU memory consumption. Major families include:
LoRA (Low‑Rank Adaptation) : Adds low‑rank matrices A and B to frozen weight matrices, updating only these small adapters during fine‑tuning.
QLoRA (Quantized LoRA) : Combines LoRA with 4‑bit quantization to further shrink memory usage.
AdaLoRA, Prefix‑tuning, Prompt‑tuning, P‑tuning v2 : Various ways to inject trainable prompts or adapters at different layers, each with distinct trade‑offs in speed, memory, and performance.
Choosing between pre‑training and instruction fine‑tuning depends on the target domain, data availability, and desired inference latency. The article also discusses common pitfalls such as catastrophic forgetting, data scaling, and GPU memory budgeting for full‑parameter versus PEFT methods.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
