Master Parameter-Efficient Fine‑Tuning: LoRA & QLoRA Explained for Interviews

This article explains why full fine‑tuning of large models is impractical, introduces parameter‑efficient fine‑tuning (PEFT) with LoRA and QLoRA, provides mathematical foundations, implementation code, resource‑usage analysis, interview question templates, and practical deployment tips for real‑world AI projects.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Master Parameter-Efficient Fine‑Tuning: LoRA & QLoRA Explained for Interviews

Background: Why Parameter‑Efficient Fine‑Tuning?

Full fine‑tuning (FT) of a 7B model requires ~84 GB of GPU memory per card (model parameters, gradients, and optimizer states), which exceeds the capacity of most V100 GPUs and makes training costly, prone to over‑fitting on small data, and difficult to deploy multiple tasks.

Core Idea of PEFT

Parameter‑efficient fine‑tuning (PEFT) updates only a small set of additional parameters while keeping the original pretrained weights frozen, achieving performance close to full FT with dramatically lower memory and compute requirements.

Interview High‑Frequency Question List

What is the mathematical principle behind LoRA? Why does low‑rank decomposition work for fine‑tuning?

How to choose the rank and alpha parameters in LoRA? Any heuristics?

What are the three innovations of QLoRA? How does 4‑bit quantisation preserve training accuracy?

Differences among Prefix Tuning, Prompt Tuning, and P‑tuning?

When should you use full fine‑tuning instead of LoRA?

How to evaluate the effectiveness of different fine‑tuning methods?

How to merge and deploy a LoRA‑fine‑tuned model?

Considerations for mixed‑precision training in fine‑tuning?

Deep Dive: LoRA Mathematics and Implementation

Step 1 – Theoretical Basis: Low Intrinsic Dimension

LoRA assumes that the fine‑tuning trajectory of a large model lies in a low‑intrinsic‑dimension subspace, allowing the weight update ΔW to be expressed as a low‑rank product A @ B where rank << min(d, k).

# Traditional fine‑tuning: update full weight matrix
W_new = W_pretrained + ΔW

# LoRA assumption: ΔW can be factorised
# ΔW ∈ ℝ^(d×k) → A ∈ ℝ^(d×r) × B ∈ ℝ^(r×k)
W_new = W_pretrained + A @ B

Step 2 – Complete LoRA Layer Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class LoRALinear(nn.Module):
    def __init__(self, in_features: int, out_features: int, rank: int = 16,
                 alpha: float = 16.0, dropout: float = 0.1, bias: bool = True):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.alpha = alpha
        # frozen original weight
        self.linear = nn.Linear(in_features, out_features, bias=bias)
        self.linear.weight.requires_grad = False
        if bias:
            self.linear.bias.requires_grad = False
        # LoRA matrices
        self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        self.dropout = nn.Dropout(dropout)
        self.scaling = alpha / rank
        self._init_weights()

    def _init_weights(self):
        """LoRA weight initialisation"""
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        original_output = self.linear(x)
        lora_output = self.dropout(x) @ self.lora_A @ self.lora_B * self.scaling
        return original_output + lora_output

    def merge_weights(self):
        """Merge LoRA weights into the frozen linear layer"""
        if not self.linear.weight.requires_grad:
            merged_weight = self.linear.weight.data + (self.lora_A @ self.lora_B * self.scaling).T
            merged_linear = nn.Linear(self.in_features, self.out_features, bias=self.linear.bias is not None)
            merged_linear.weight.data = merged_weight
            if self.linear.bias is not None:
                merged_linear.bias.data = self.linear.bias.data
            return merged_linear
        else:
            raise ValueError("Original weight not frozen, cannot merge")

    def get_delta_weight(self) -> torch.Tensor:
        """Return the LoRA incremental weight"""
        return (self.lora_A @ self.lora_B * self.scaling).T

    def extra_repr(self) -> str:
        return f'in_features={self.in_features}, out_features={self.out_features}, rank={self.rank}, alpha={self.alpha}'

class LoRAConfig:
    """Configuration class for LoRA"""
    def __init__(self, rank: int = 16, alpha: float = 16.0,
                 target_modules: list = None, dropout: float = 0.1, bias: str = "none"):
        self.rank = rank
        self.alpha = alpha
        self.target_modules = target_modules or ["q_proj", "v_proj", "k_proj", "o_proj"]
        self.dropout = dropout
        self.bias = bias

def apply_lora_to_model(model: nn.Module, config: LoRAConfig):
    """Replace target linear modules in the model with LoRA layers"""
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            parent_name = name.split('.')[-1]
            if parent_name in config.target_modules:
                lora_layer = LoRALinear(module.in_features, module.out_features,
                                        rank=config.rank, alpha=config.alpha,
                                        dropout=config.dropout, bias=module.bias is not None)
                lora_layer.linear.weight.data = module.weight.data.clone()
                if module.bias is not None:
                    lora_layer.linear.bias.data = module.bias.data.clone()
                # replace module in the parent hierarchy
                parent = model
                atoms = name.split('.')[:-1]
                for atom in atoms:
                    parent = getattr(parent, atom)
                setattr(parent, parent_name, lora_layer)

Step 3 – Parameter and Complexity Analysis

def analyze_lora_complexity():
    """Analyse LoRA parameter count and FLOPs"""
    d_model = 4096
    rank = 16
    num_layers = 32
    # Original attention parameters (q,k,v,o projections)
    original_params_per_layer = 4 * d_model * d_model
    original_total_params = original_params_per_layer * num_layers
    # LoRA parameters (A and B matrices for each projection)
    lora_params_per_layer = 4 * (d_model * rank + rank * d_model)
    lora_total_params = lora_params_per_layer * num_layers
    reduction_ratio = original_total_params / lora_total_params
    print("=== LoRA Parameter Analysis ===")
    print(f"Original model params: {original_total_params:,} ({original_total_params/1e9:.1f}B)")
    print(f"LoRA params: {lora_total_params:,} ({lora_total_params/1e6:.1f}M)")
    print(f"Reduction: {reduction_ratio:.1f}x")
    # FLOPs
    batch_size = 32
    seq_len = 2048
    original_flops = batch_size * seq_len * d_model * d_model * 4 * num_layers
    lora_flops = batch_size * seq_len * d_model * rank * 2 * 4 * num_layers
    compute_reduction = original_flops / lora_flops
    print("
=== Compute FLOPs Analysis ===")
    print(f"Original FLOPs: {original_flops:.2e}")
    print(f"LoRA FLOPs: {lora_flops:.2e}")
    print(f"Compute reduction: {compute_reduction:.1f}x")
    # Memory (FP16 assumption)
    original_memory_gb = original_total_params * 2 / 1e9
    lora_memory_gb = (lora_total_params * 2) / 1e9
    original_training_memory = original_memory_gb * 4  # params + grads + Adam states
    lora_training_memory = original_memory_gb + lora_memory_gb * 4
    memory_reduction = original_training_memory / lora_training_memory
    print("
=== Memory Analysis ===")
    print(f"Original training memory: {original_training_memory:.1f} GB")
    print(f"LoRA training memory: {lora_training_memory:.1f} GB")
    print(f"Memory saving: {memory_reduction:.1f}x")

Key Principle Diagram (Textual)

Original weight update: W → ΔW → W' requires 16 M parameters. LoRA factorises ΔW = A @ B with A ∈ ℝ^(4096×16) and B ∈ ℝ^(16×4096), reducing trainable parameters to 131 K (≈128× reduction). Forward pass becomes y = Wx + (dropout(x) @ A @ B) × (α/r).

Scenario Example: Customer‑Service Bot Fine‑Tuning

class CustomerServiceLoRATrainer:
    """LoRA trainer for a customer‑service chatbot"""
    def __init__(self, base_model, tokenizer):
        self.base_model = base_model
        self.tokenizer = tokenizer
        self.lora_config = None
        self.training_stats = {'loss_history': [], 'eval_scores': []}

    def setup_lora(self, rank=16, alpha=32, target_modules=None):
        if target_modules is None:
            target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                             "gate_proj", "up_proj", "down_proj"]
        self.lora_config = LoRAConfig(rank=rank, alpha=alpha,
                                      target_modules=target_modules, dropout=0.1)
        apply_lora_to_model(self.base_model, self.lora_config)
        self._print_trainable_parameters()

    def _print_trainable_parameters(self):
        total_params = sum(p.numel() for p in self.base_model.parameters())
        trainable_params = sum(p.numel() for p in self.base_model.parameters() if p.requires_grad)
        print(f"Total params: {total_params:,}")
        print(f"Trainable params: {trainable_params:,}")
        print(f"Trainable ratio: {100 * trainable_params / total_params:.2f}%")

    def train_on_customer_data(self, train_dataset, eval_dataset, epochs=3):
        optimizer = torch.optim.AdamW([p for p in self.base_model.parameters() if p.requires_grad],
                                      lr=1e-4, weight_decay=0.01)
        for epoch in range(epochs):
            epoch_loss = 0.0
            self.base_model.train()
            for batch in train_dataset:
                inputs = self.tokenizer(batch['conversations'], return_tensors='pt',
                                        padding=True, truncation=True, max_length=512)
                outputs = self.base_model(**inputs, labels=inputs['input_ids'])
                loss = outputs.loss
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.base_model.parameters(), 1.0)
                optimizer.step()
                optimizer.zero_grad()
                epoch_loss += loss.item()
            eval_score = self._evaluate(eval_dataset)
            print(f"Epoch {epoch+1}: Loss={epoch_loss:.4f}, Eval Score={eval_score:.4f}")
            self.training_stats['loss_history'].append(epoch_loss)
            self.training_stats['eval_scores'].append(eval_score)

    def _evaluate(self, eval_dataset):
        self.base_model.eval()
        total_score = 0.0
        with torch.no_grad():
            for batch in eval_dataset:
                inputs = self.tokenizer(batch['conversations'], return_tensors='pt')
                outputs = self.base_model(**inputs, labels=inputs['input_ids'])
                total_score += torch.exp(outputs.loss).item()
        return total_score / len(eval_dataset)

    def save_lora_weights(self, path):
        lora_state_dict = {name: param.data for name, param in self.base_model.named_parameters()
                           if 'lora_' in name and param.requires_grad}
        torch.save({'lora_state_dict': lora_state_dict,
                    'lora_config': self.lora_config,
                    'training_stats': self.training_stats}, path)
        print(f"LoRA weights saved to: {path}")

QLoRA: 4‑Bit Quantisation Breakthrough

Core Technical Innovation

class QLoRALinear(nn.Module):
    """QLoRA combines 4‑bit NF4 quantisation with LoRA"""
    def __init__(self, in_features, out_features, rank=16, alpha=16.0):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.alpha = alpha
        # 4‑bit frozen base weight (to be quantised later)
        self.base_weight_4bit = None
        self.weight_scales = None
        # LoRA parameters kept in FP16/BF16
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        self.scaling = alpha / rank

    def quantize_weight(self, weight: torch.Tensor):
        """NF4 4‑bit quantisation (block size 64)"""
        nf4_values = torch.tensor([-1.0, -0.6962, -0.5251, -0.3949, -0.2844,
                                   -0.1848, -0.0911, 0.0, 0.0911, 0.1848,
                                   0.2844, 0.3949, 0.5251, 0.6962, 1.0, float('inf')])
        block_size = 64
        weight_flat = weight.view(-1)
        quantized_blocks = []
        scale_blocks = []
        for i in range(0, weight_flat.numel(), block_size):
            block = weight_flat[i:i+block_size]
            scale = block.abs().max() / 7  # 4‑bit range [-7,7]
            scale_blocks.append(scale)
            normalized = block / scale
            quantized = torch.searchsorted(nf4_values, normalized)
            quantized = torch.clamp(quantized, 0, 15)
            quantized_blocks.append(quantized)
        self.base_weight_4bit = torch.cat(quantized_blocks)
        self.weight_scales = torch.stack(scale_blocks)

    def dequantize_weight(self):
        """Reconstruct FP16 weight from 4‑bit representation"""
        nf4_values = torch.tensor([-1.0, -0.6962, -0.5251, -0.3949, -0.2844,
                                   -0.1848, -0.0911, 0.0, 0.0911, 0.1848,
                                   0.2844, 0.3949, 0.5251, 0.6962, 1.0, float('inf')])
        block_size = 64
        dequantized_blocks = []
        for i, scale in enumerate(self.weight_scales):
            start = i * block_size
            end = min(start + block_size, self.base_weight_4bit.numel())
            quantized_block = self.base_weight_4bit[start:end]
            dequantized_vals = nf4_values[quantized_block]
            dequantized_blocks.append(dequantized_vals * scale)
        dequantized = torch.cat(dequantized_blocks)
        return dequantized.view(self.in_features, self.out_features)

    def forward(self, x):
        base_weight = self.dequantize_weight()
        base_output = F.linear(x, base_weight.T)
        lora_output = (x @ self.lora_A @ self.lora_B) * self.scaling
        return base_output + lora_output

QLoRA Memory Savings Analysis

def analyze_qlora_memory():
    """Compare FP16 full‑FT memory with QLoRA memory"""
    d_model = 4096
    num_layers = 32
    rank = 16
    original_params = d_model * d_model * 4 * num_layers  # FP16
    original_memory_gb = original_params * 2 / 1e9
    # QLoRA: 4‑bit base weight + FP16 LoRA weight
    base_4bit_memory = original_params * 0.5 / 1e9
    lora_memory = d_model * rank * 2 * 4 * num_layers * 2 / 1e9
    qlora_memory_gb = base_4bit_memory + lora_memory
    print("=== QLoRA Memory Analysis ===")
    print(f"Original FP16 memory: {original_memory_gb:.1f} GB")
    print(f"QLoRA total memory: {qlora_memory_gb:.1f} GB")
    print(f"4‑bit base weight: {base_4bit_memory:.1f} GB")
    print(f"LoRA weight: {lora_memory:.1f} GB")
    print(f"Memory saving: {original_memory_gb / qlora_memory_gb:.1f}x")

Interview Answer Templates

Q1: How to choose LoRA rank ?

Task complexity : simple tasks (e.g., sentiment analysis) use rank=8‑16; complex tasks (e.g., code generation) use rank=64‑128.

Model size : larger models can afford higher rank, smaller models benefit from lower values.

Data volume : limited data → lower rank to avoid over‑fitting; abundant data → higher rank.

Heuristic formula : rank ≈ sqrt(min(input_dim, output_dim)) / 4.

In practice start with rank=16 and adjust based on validation performance.

Q2: When might QLoRA perform poorly?

Precision‑sensitive tasks : mathematical computation or code generation where 4‑bit quantisation noise hurts accuracy.

Very small fine‑tuning datasets : quantisation noise can dominate.

Hardware incompatibility : some GPUs lack efficient 4‑bit kernels, leading to slower training.

Inference‑latency‑critical scenarios : extra de‑quantisation overhead may increase latency.

Mitigation: weigh accuracy versus efficiency for the target use‑case.

Q3: How to evaluate different fine‑tuning methods?

Task performance : accuracy, F1, BLEU, etc., on the target task.

Generalisation : gap between validation and test set results.

Training efficiency : convergence speed, GPU memory consumption, wall‑clock time.

Deployment cost : final model size, inference speed, hardware requirements.

Design controlled experiments that vary only the fine‑tuning method.

Q4: How to deploy a LoRA‑fine‑tuned model?

Weight‑merge deployment : merge LoRA adapters into the base model and serve a standard checkpoint.

Separate‑adapter deployment : keep LoRA weights external and load them at runtime, useful for multi‑task switching.

Weight merging is simpler for a single‑task service; separate adapters enable flexible multi‑task scenarios.

Review Checklist

LoRA core : low‑rank assumption, A/B matrix factorisation, rank/alpha selection.

QLoRA innovations : 4‑bit NF4 quantisation, dual‑weight scheme, optimiser paging.

Parameter analysis : ~128× parameter reduction, 4‑8× memory saving, compute overhead breakdown.

Implementation details : weight initialisation, scaling factor, gradient handling, weight merging.

Engineering experience : rank‑selection heuristics, training tricks, deployment strategies.

Effect evaluation : multi‑metric assessment, comparative experiment design.

model compressionLoRAQLoRAlow-rank adaptationparameter-efficient fine-tuning
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.