Why LoRA, QLoRA, Prompt & Prefix Tuning Are Changing Large‑Model Fine‑Tuning
This article explains the mathematical basis of LoRA, compares it with QLoRA, Prompt Tuning, Prefix Tuning and P‑tuning, shows practical PyTorch implementations, and provides mixed‑precision training tips so readers can choose the most memory‑efficient fine‑tuning method for their large language models.
Traditional Fine‑tuning Problems
Fine‑tuning a 7 B‑parameter model in FP16 requires roughly 84 GB of GPU memory (parameters, gradients, and optimizer states), which exceeds the 80 GB of a single A100 and makes large‑scale training impractical.
Model parameters: 7 B × 2 bytes = 14 GB
Gradients: 7 B × 2 bytes = 14 GB
Adam optimizer states: 7 B × 8 bytes = 56 GB
LoRA Solution
Core assumption : The pretrained model already captures most knowledge; fine‑tuning only needs a low‑rank additive update. W_new = W + ΔW Key insight : The update ΔW can be factorised into two small matrices. ΔW = A × B W shape: (d, k) e.g., (4096, 4096)
A shape: (d, r) e.g., (4096, 16)
B shape: (r, k) e.g., (16, 4096)
r ≪ min(d, k) is the rank
Mathematical Intuition
Think of a full‑size linear layer as a 4096 × 4096 brush where every pixel can be coloured independently. LoRA replaces it with 16 “basic brushes” (the columns of A) that are combined by B, drastically reducing degrees of freedom while preserving most expressive power.
Hand‑written LoRA implementation (PyTorch)
import torch
import torch.nn as nn
import math
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, rank=16, alpha=16, dropout=0.1):
super().__init__()
self.linear = nn.Linear(in_features, out_features, bias=False)
self.linear.weight.requires_grad = False
self.rank = rank
self.alpha = alpha
self.dropout = nn.Dropout(dropout)
self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
self.scaling = alpha / rank
def forward(self, x):
original_output = self.linear(x)
lora_output = self.dropout(x) @ self.lora_A @ self.lora_B * self.scaling
return original_output + lora_outputWhy LoRA works
Parameter reduction : A 4096 × 4096 matrix (≈16 M parameters) becomes two small matrices 4096 × 16 + 16 × 4096 (≈131 K parameters), a 99 % reduction.
Mathematical justification : Most weight matrices are low‑rank; singular‑value decomposition shows that a few dominant singular values capture the useful information.
Empirical evidence : Experiments show that LoRA with rank = 1–64 achieves >95 % of full‑fine‑tuning performance on many tasks.
QLoRA vs. LoRA
QLoRA (Quantized LoRA) adds 4‑bit quantisation of the base model and a double‑quantisation of the scaling factors, reducing the base model’s memory from ~14 GB (FP16) to ~3.5 GB (4‑bit) while keeping LoRA’s trainable parameters.
Core innovations
4‑bit base model : The entire pretrained checkpoint is stored in 4‑bit, cutting memory dramatically.
Double quantisation : Scaling factors are further compressed to 8‑bit, avoiding the overhead of storing a full FP16 scale per block.
def double_quantization(weights, block_size=64):
# First quantise FP16 → 4‑bit
scales_fp16 = []
quantized_4bit = []
for block in weights.split(block_size):
scale = block.abs().max() / 7 # 4‑bit range [-7,7]
scales_fp16.append(scale)
quantized = torch.round(block / scale).clamp(-7, 7)
quantized_4bit.append(quantized)
# Second quantise the scales FP16 → 8‑bit
scales_fp16 = torch.stack(scales_fp16)
scale_scale = scales_fp16.abs().max() / 127
scales_8bit = torch.round(scales_fp16 / scale_scale).clamp(-127, 127)
return quantized_4bit, scales_8bit, scale_scaleMixed‑precision training details
Forward pass: most ops in FP16, numerically sensitive ops in FP32.
Backward pass: gradients computed in FP16, accumulation in FP32.
Parameter update: performed in FP32 for stability.
Practical mixed‑precision training in PyTorch
import torch
from torch.cuda.amp import autocast, GradScaler
model = YourModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
with autocast():
outputs = model(batch['input_ids'])
loss = criterion(outputs, batch['labels'])
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()Prompt Tuning, Prefix Tuning, P‑tuning
All three methods modify the input representation instead of the full weight matrix.
class PromptTuning(nn.Module):
def __init__(self, model, prompt_length=20, embed_dim=768):
super().__init__()
self.model = model
self.soft_prompt = nn.Parameter(torch.randn(prompt_length, embed_dim))
def forward(self, input_ids):
batch_size = input_ids.shape[0]
inputs_embeds = self.model.get_input_embeddings()(input_ids)
prompt_embeds = self.soft_prompt.unsqueeze(0).expand(batch_size, -1, -1)
inputs_embeds = torch.cat([prompt_embeds, inputs_embeds], dim=1)
return self.model(inputs_embeds=inputs_embeds) class PrefixTuning(nn.Module):
def __init__(self, model, prefix_length=20, num_layers=12, hidden_size=768):
super().__init__()
self.model = model
self.prefix_length = prefix_length
self.prefix_encoder = nn.ModuleList([
nn.Sequential(nn.Linear(hidden_size, hidden_size),
nn.Tanh(),
nn.Linear(hidden_size, 2*hidden_size))
for _ in range(num_layers)
])
self.prefix_tokens = nn.Parameter(torch.randn(prefix_length, hidden_size))
def get_prefix_states(self, batch_size):
prefix_embeds = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1, -1)
prefix_states = []
for layer in self.prefix_encoder:
prefix_kv = layer(prefix_embeds)
prefix_k, prefix_v = prefix_kv.chunk(2, dim=-1)
prefix_states.append((prefix_k, prefix_v))
return prefix_states
def forward(self, input_ids):
batch_size = input_ids.shape[0]
prefix_states = self.get_prefix_states(batch_size)
# Hook each attention layer to add prefix K,V (implementation omitted)
return self.model(input_ids, prefix_states=prefix_states) class PTuning(nn.Module):
def __init__(self, model, pattern="[P0][P1][P2] {text} [P3]", hidden_size=768):
super().__init__()
self.model = model
self.pattern = pattern
self.prompt_tokens = [t for t in pattern.split() if t.startswith('[P') and t.endswith(']')]
self.prompt_embeddings = nn.ParameterDict({
token: nn.Parameter(torch.randn(hidden_size)) for token in self.prompt_tokens
})
self.prompt_encoder = nn.Sequential(nn.Linear(hidden_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, hidden_size))
def forward(self, input_ids, text_span):
# Build mixed input sequence according to pattern (logic omitted)
passChoosing the right method
Abundant GPU memory (>40 GB) : Prompt Tuning for simple tasks; LoRA (rank = 32‑64) for more demanding tasks.
Limited memory (16‑32 GB) : LoRA with low rank (8‑16) is the primary choice; Prefix Tuning as a backup.
Very low memory (<16 GB) : QLoRA is the only viable solution, optionally combined with gradient checkpointing.
Engineering recommendations
Start with LoRA – it offers a good trade‑off between performance and implementation simplicity.
Iteratively increase rank or alpha only if the task demands higher capacity.
Monitor mixed‑precision stability (loss‑scale history, gradient norms) and adjust learning‑rate or scaling factors accordingly.
Run systematic experiments; the best method varies across datasets and model architectures.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
