Master Parameter-Efficient Fine‑Tuning: LoRA & QLoRA Explained for Interviews
This article explains why full fine‑tuning of large models is impractical, introduces parameter‑efficient fine‑tuning (PEFT) with LoRA and QLoRA, provides mathematical foundations, implementation code, resource‑usage analysis, interview question templates, and practical deployment tips for real‑world AI projects.
Background: Why Parameter‑Efficient Fine‑Tuning?
Full fine‑tuning (FT) of a 7B model requires ~84 GB of GPU memory per card (model parameters, gradients, and optimizer states), which exceeds the capacity of most V100 GPUs and makes training costly, prone to over‑fitting on small data, and difficult to deploy multiple tasks.
Core Idea of PEFT
Parameter‑efficient fine‑tuning (PEFT) updates only a small set of additional parameters while keeping the original pretrained weights frozen, achieving performance close to full FT with dramatically lower memory and compute requirements.
Interview High‑Frequency Question List
What is the mathematical principle behind LoRA? Why does low‑rank decomposition work for fine‑tuning?
How to choose the rank and alpha parameters in LoRA? Any heuristics?
What are the three innovations of QLoRA? How does 4‑bit quantisation preserve training accuracy?
Differences among Prefix Tuning, Prompt Tuning, and P‑tuning?
When should you use full fine‑tuning instead of LoRA?
How to evaluate the effectiveness of different fine‑tuning methods?
How to merge and deploy a LoRA‑fine‑tuned model?
Considerations for mixed‑precision training in fine‑tuning?
Deep Dive: LoRA Mathematics and Implementation
Step 1 – Theoretical Basis: Low Intrinsic Dimension
LoRA assumes that the fine‑tuning trajectory of a large model lies in a low‑intrinsic‑dimension subspace, allowing the weight update ΔW to be expressed as a low‑rank product A @ B where rank << min(d, k).
# Traditional fine‑tuning: update full weight matrix
W_new = W_pretrained + ΔW
# LoRA assumption: ΔW can be factorised
# ΔW ∈ ℝ^(d×k) → A ∈ ℝ^(d×r) × B ∈ ℝ^(r×k)
W_new = W_pretrained + A @ BStep 2 – Complete LoRA Layer Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class LoRALinear(nn.Module):
def __init__(self, in_features: int, out_features: int, rank: int = 16,
alpha: float = 16.0, dropout: float = 0.1, bias: bool = True):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.rank = rank
self.alpha = alpha
# frozen original weight
self.linear = nn.Linear(in_features, out_features, bias=bias)
self.linear.weight.requires_grad = False
if bias:
self.linear.bias.requires_grad = False
# LoRA matrices
self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
self.dropout = nn.Dropout(dropout)
self.scaling = alpha / rank
self._init_weights()
def _init_weights(self):
"""LoRA weight initialisation"""
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x: torch.Tensor) -> torch.Tensor:
original_output = self.linear(x)
lora_output = self.dropout(x) @ self.lora_A @ self.lora_B * self.scaling
return original_output + lora_output
def merge_weights(self):
"""Merge LoRA weights into the frozen linear layer"""
if not self.linear.weight.requires_grad:
merged_weight = self.linear.weight.data + (self.lora_A @ self.lora_B * self.scaling).T
merged_linear = nn.Linear(self.in_features, self.out_features, bias=self.linear.bias is not None)
merged_linear.weight.data = merged_weight
if self.linear.bias is not None:
merged_linear.bias.data = self.linear.bias.data
return merged_linear
else:
raise ValueError("Original weight not frozen, cannot merge")
def get_delta_weight(self) -> torch.Tensor:
"""Return the LoRA incremental weight"""
return (self.lora_A @ self.lora_B * self.scaling).T
def extra_repr(self) -> str:
return f'in_features={self.in_features}, out_features={self.out_features}, rank={self.rank}, alpha={self.alpha}'
class LoRAConfig:
"""Configuration class for LoRA"""
def __init__(self, rank: int = 16, alpha: float = 16.0,
target_modules: list = None, dropout: float = 0.1, bias: str = "none"):
self.rank = rank
self.alpha = alpha
self.target_modules = target_modules or ["q_proj", "v_proj", "k_proj", "o_proj"]
self.dropout = dropout
self.bias = bias
def apply_lora_to_model(model: nn.Module, config: LoRAConfig):
"""Replace target linear modules in the model with LoRA layers"""
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
parent_name = name.split('.')[-1]
if parent_name in config.target_modules:
lora_layer = LoRALinear(module.in_features, module.out_features,
rank=config.rank, alpha=config.alpha,
dropout=config.dropout, bias=module.bias is not None)
lora_layer.linear.weight.data = module.weight.data.clone()
if module.bias is not None:
lora_layer.linear.bias.data = module.bias.data.clone()
# replace module in the parent hierarchy
parent = model
atoms = name.split('.')[:-1]
for atom in atoms:
parent = getattr(parent, atom)
setattr(parent, parent_name, lora_layer)Step 3 – Parameter and Complexity Analysis
def analyze_lora_complexity():
"""Analyse LoRA parameter count and FLOPs"""
d_model = 4096
rank = 16
num_layers = 32
# Original attention parameters (q,k,v,o projections)
original_params_per_layer = 4 * d_model * d_model
original_total_params = original_params_per_layer * num_layers
# LoRA parameters (A and B matrices for each projection)
lora_params_per_layer = 4 * (d_model * rank + rank * d_model)
lora_total_params = lora_params_per_layer * num_layers
reduction_ratio = original_total_params / lora_total_params
print("=== LoRA Parameter Analysis ===")
print(f"Original model params: {original_total_params:,} ({original_total_params/1e9:.1f}B)")
print(f"LoRA params: {lora_total_params:,} ({lora_total_params/1e6:.1f}M)")
print(f"Reduction: {reduction_ratio:.1f}x")
# FLOPs
batch_size = 32
seq_len = 2048
original_flops = batch_size * seq_len * d_model * d_model * 4 * num_layers
lora_flops = batch_size * seq_len * d_model * rank * 2 * 4 * num_layers
compute_reduction = original_flops / lora_flops
print("
=== Compute FLOPs Analysis ===")
print(f"Original FLOPs: {original_flops:.2e}")
print(f"LoRA FLOPs: {lora_flops:.2e}")
print(f"Compute reduction: {compute_reduction:.1f}x")
# Memory (FP16 assumption)
original_memory_gb = original_total_params * 2 / 1e9
lora_memory_gb = (lora_total_params * 2) / 1e9
original_training_memory = original_memory_gb * 4 # params + grads + Adam states
lora_training_memory = original_memory_gb + lora_memory_gb * 4
memory_reduction = original_training_memory / lora_training_memory
print("
=== Memory Analysis ===")
print(f"Original training memory: {original_training_memory:.1f} GB")
print(f"LoRA training memory: {lora_training_memory:.1f} GB")
print(f"Memory saving: {memory_reduction:.1f}x")Key Principle Diagram (Textual)
Original weight update: W → ΔW → W' requires 16 M parameters. LoRA factorises ΔW = A @ B with A ∈ ℝ^(4096×16) and B ∈ ℝ^(16×4096), reducing trainable parameters to 131 K (≈128× reduction). Forward pass becomes y = Wx + (dropout(x) @ A @ B) × (α/r).
Scenario Example: Customer‑Service Bot Fine‑Tuning
class CustomerServiceLoRATrainer:
"""LoRA trainer for a customer‑service chatbot"""
def __init__(self, base_model, tokenizer):
self.base_model = base_model
self.tokenizer = tokenizer
self.lora_config = None
self.training_stats = {'loss_history': [], 'eval_scores': []}
def setup_lora(self, rank=16, alpha=32, target_modules=None):
if target_modules is None:
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"]
self.lora_config = LoRAConfig(rank=rank, alpha=alpha,
target_modules=target_modules, dropout=0.1)
apply_lora_to_model(self.base_model, self.lora_config)
self._print_trainable_parameters()
def _print_trainable_parameters(self):
total_params = sum(p.numel() for p in self.base_model.parameters())
trainable_params = sum(p.numel() for p in self.base_model.parameters() if p.requires_grad)
print(f"Total params: {total_params:,}")
print(f"Trainable params: {trainable_params:,}")
print(f"Trainable ratio: {100 * trainable_params / total_params:.2f}%")
def train_on_customer_data(self, train_dataset, eval_dataset, epochs=3):
optimizer = torch.optim.AdamW([p for p in self.base_model.parameters() if p.requires_grad],
lr=1e-4, weight_decay=0.01)
for epoch in range(epochs):
epoch_loss = 0.0
self.base_model.train()
for batch in train_dataset:
inputs = self.tokenizer(batch['conversations'], return_tensors='pt',
padding=True, truncation=True, max_length=512)
outputs = self.base_model(**inputs, labels=inputs['input_ids'])
loss = outputs.loss
loss.backward()
torch.nn.utils.clip_grad_norm_(self.base_model.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad()
epoch_loss += loss.item()
eval_score = self._evaluate(eval_dataset)
print(f"Epoch {epoch+1}: Loss={epoch_loss:.4f}, Eval Score={eval_score:.4f}")
self.training_stats['loss_history'].append(epoch_loss)
self.training_stats['eval_scores'].append(eval_score)
def _evaluate(self, eval_dataset):
self.base_model.eval()
total_score = 0.0
with torch.no_grad():
for batch in eval_dataset:
inputs = self.tokenizer(batch['conversations'], return_tensors='pt')
outputs = self.base_model(**inputs, labels=inputs['input_ids'])
total_score += torch.exp(outputs.loss).item()
return total_score / len(eval_dataset)
def save_lora_weights(self, path):
lora_state_dict = {name: param.data for name, param in self.base_model.named_parameters()
if 'lora_' in name and param.requires_grad}
torch.save({'lora_state_dict': lora_state_dict,
'lora_config': self.lora_config,
'training_stats': self.training_stats}, path)
print(f"LoRA weights saved to: {path}")QLoRA: 4‑Bit Quantisation Breakthrough
Core Technical Innovation
class QLoRALinear(nn.Module):
"""QLoRA combines 4‑bit NF4 quantisation with LoRA"""
def __init__(self, in_features, out_features, rank=16, alpha=16.0):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.rank = rank
self.alpha = alpha
# 4‑bit frozen base weight (to be quantised later)
self.base_weight_4bit = None
self.weight_scales = None
# LoRA parameters kept in FP16/BF16
self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
self.scaling = alpha / rank
def quantize_weight(self, weight: torch.Tensor):
"""NF4 4‑bit quantisation (block size 64)"""
nf4_values = torch.tensor([-1.0, -0.6962, -0.5251, -0.3949, -0.2844,
-0.1848, -0.0911, 0.0, 0.0911, 0.1848,
0.2844, 0.3949, 0.5251, 0.6962, 1.0, float('inf')])
block_size = 64
weight_flat = weight.view(-1)
quantized_blocks = []
scale_blocks = []
for i in range(0, weight_flat.numel(), block_size):
block = weight_flat[i:i+block_size]
scale = block.abs().max() / 7 # 4‑bit range [-7,7]
scale_blocks.append(scale)
normalized = block / scale
quantized = torch.searchsorted(nf4_values, normalized)
quantized = torch.clamp(quantized, 0, 15)
quantized_blocks.append(quantized)
self.base_weight_4bit = torch.cat(quantized_blocks)
self.weight_scales = torch.stack(scale_blocks)
def dequantize_weight(self):
"""Reconstruct FP16 weight from 4‑bit representation"""
nf4_values = torch.tensor([-1.0, -0.6962, -0.5251, -0.3949, -0.2844,
-0.1848, -0.0911, 0.0, 0.0911, 0.1848,
0.2844, 0.3949, 0.5251, 0.6962, 1.0, float('inf')])
block_size = 64
dequantized_blocks = []
for i, scale in enumerate(self.weight_scales):
start = i * block_size
end = min(start + block_size, self.base_weight_4bit.numel())
quantized_block = self.base_weight_4bit[start:end]
dequantized_vals = nf4_values[quantized_block]
dequantized_blocks.append(dequantized_vals * scale)
dequantized = torch.cat(dequantized_blocks)
return dequantized.view(self.in_features, self.out_features)
def forward(self, x):
base_weight = self.dequantize_weight()
base_output = F.linear(x, base_weight.T)
lora_output = (x @ self.lora_A @ self.lora_B) * self.scaling
return base_output + lora_outputQLoRA Memory Savings Analysis
def analyze_qlora_memory():
"""Compare FP16 full‑FT memory with QLoRA memory"""
d_model = 4096
num_layers = 32
rank = 16
original_params = d_model * d_model * 4 * num_layers # FP16
original_memory_gb = original_params * 2 / 1e9
# QLoRA: 4‑bit base weight + FP16 LoRA weight
base_4bit_memory = original_params * 0.5 / 1e9
lora_memory = d_model * rank * 2 * 4 * num_layers * 2 / 1e9
qlora_memory_gb = base_4bit_memory + lora_memory
print("=== QLoRA Memory Analysis ===")
print(f"Original FP16 memory: {original_memory_gb:.1f} GB")
print(f"QLoRA total memory: {qlora_memory_gb:.1f} GB")
print(f"4‑bit base weight: {base_4bit_memory:.1f} GB")
print(f"LoRA weight: {lora_memory:.1f} GB")
print(f"Memory saving: {original_memory_gb / qlora_memory_gb:.1f}x")Interview Answer Templates
Q1: How to choose LoRA rank ?
Task complexity : simple tasks (e.g., sentiment analysis) use rank=8‑16; complex tasks (e.g., code generation) use rank=64‑128.
Model size : larger models can afford higher rank, smaller models benefit from lower values.
Data volume : limited data → lower rank to avoid over‑fitting; abundant data → higher rank.
Heuristic formula : rank ≈ sqrt(min(input_dim, output_dim)) / 4.
In practice start with rank=16 and adjust based on validation performance.
Q2: When might QLoRA perform poorly?
Precision‑sensitive tasks : mathematical computation or code generation where 4‑bit quantisation noise hurts accuracy.
Very small fine‑tuning datasets : quantisation noise can dominate.
Hardware incompatibility : some GPUs lack efficient 4‑bit kernels, leading to slower training.
Inference‑latency‑critical scenarios : extra de‑quantisation overhead may increase latency.
Mitigation: weigh accuracy versus efficiency for the target use‑case.
Q3: How to evaluate different fine‑tuning methods?
Task performance : accuracy, F1, BLEU, etc., on the target task.
Generalisation : gap between validation and test set results.
Training efficiency : convergence speed, GPU memory consumption, wall‑clock time.
Deployment cost : final model size, inference speed, hardware requirements.
Design controlled experiments that vary only the fine‑tuning method.
Q4: How to deploy a LoRA‑fine‑tuned model?
Weight‑merge deployment : merge LoRA adapters into the base model and serve a standard checkpoint.
Separate‑adapter deployment : keep LoRA weights external and load them at runtime, useful for multi‑task switching.
Weight merging is simpler for a single‑task service; separate adapters enable flexible multi‑task scenarios.
Review Checklist
LoRA core : low‑rank assumption, A/B matrix factorisation, rank/alpha selection.
QLoRA innovations : 4‑bit NF4 quantisation, dual‑weight scheme, optimiser paging.
Parameter analysis : ~128× parameter reduction, 4‑8× memory saving, compute overhead breakdown.
Implementation details : weight initialisation, scaling factor, gradient handling, weight merging.
Engineering experience : rank‑selection heuristics, training tricks, deployment strategies.
Effect evaluation : multi‑metric assessment, comparative experiment design.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
