Artificial Intelligence 17 min read

Why LoRA, QLoRA, Prompt & Prefix Tuning Are Changing Large‑Model Fine‑Tuning

This article explains the mathematical basis of LoRA, compares it with QLoRA, Prompt Tuning, Prefix Tuning and P‑tuning, shows practical PyTorch implementations, and provides mixed‑precision training tips so readers can choose the most memory‑efficient fine‑tuning method for their large language models.

Wu Shixiong's Large Model Academy

Aug 23, 2025

Why LoRA, QLoRA, Prompt & Prefix Tuning Are Changing Large‑Model Fine‑Tuning

Traditional Fine‑tuning Problems

Fine‑tuning a 7 B‑parameter model in FP16 requires roughly 84 GB of GPU memory (parameters, gradients, and optimizer states), which exceeds the 80 GB of a single A100 and makes large‑scale training impractical.

Model parameters: 7 B × 2 bytes = 14 GB

Gradients: 7 B × 2 bytes = 14 GB

Adam optimizer states: 7 B × 8 bytes = 56 GB

LoRA Solution

Core assumption : The pretrained model already captures most knowledge; fine‑tuning only needs a low‑rank additive update. W_new = W + ΔW Key insight : The update ΔW can be factorised into two small matrices. ΔW = A × B W shape: (d, k) e.g., (4096, 4096)

A shape: (d, r) e.g., (4096, 16)

B shape: (r, k) e.g., (16, 4096)

r ≪ min(d, k) is the rank

Mathematical Intuition

Think of a full‑size linear layer as a 4096 × 4096 brush where every pixel can be coloured independently. LoRA replaces it with 16 “basic brushes” (the columns of A) that are combined by B, drastically reducing degrees of freedom while preserving most expressive power.

Hand‑written LoRA implementation (PyTorch)

import torch
import torch.nn as nn
import math

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=16, alpha=16, dropout=0.1):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features, bias=False)
        self.linear.weight.requires_grad = False
        self.rank = rank
        self.alpha = alpha
        self.dropout = nn.Dropout(dropout)
        self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)
        self.scaling = alpha / rank

    def forward(self, x):
        original_output = self.linear(x)
        lora_output = self.dropout(x) @ self.lora_A @ self.lora_B * self.scaling
        return original_output + lora_output

Why LoRA works

Parameter reduction : A 4096 × 4096 matrix (≈16 M parameters) becomes two small matrices 4096 × 16 + 16 × 4096 (≈131 K parameters), a 99 % reduction.

Mathematical justification : Most weight matrices are low‑rank; singular‑value decomposition shows that a few dominant singular values capture the useful information.

Empirical evidence : Experiments show that LoRA with rank = 1–64 achieves >95 % of full‑fine‑tuning performance on many tasks.

QLoRA vs. LoRA

QLoRA (Quantized LoRA) adds 4‑bit quantisation of the base model and a double‑quantisation of the scaling factors, reducing the base model’s memory from ~14 GB (FP16) to ~3.5 GB (4‑bit) while keeping LoRA’s trainable parameters.

Core innovations

4‑bit base model : The entire pretrained checkpoint is stored in 4‑bit, cutting memory dramatically.

Double quantisation : Scaling factors are further compressed to 8‑bit, avoiding the overhead of storing a full FP16 scale per block.

def double_quantization(weights, block_size=64):
    # First quantise FP16 → 4‑bit
    scales_fp16 = []
    quantized_4bit = []
    for block in weights.split(block_size):
        scale = block.abs().max() / 7  # 4‑bit range [-7,7]
        scales_fp16.append(scale)
        quantized = torch.round(block / scale).clamp(-7, 7)
        quantized_4bit.append(quantized)
    # Second quantise the scales FP16 → 8‑bit
    scales_fp16 = torch.stack(scales_fp16)
    scale_scale = scales_fp16.abs().max() / 127
    scales_8bit = torch.round(scales_fp16 / scale_scale).clamp(-127, 127)
    return quantized_4bit, scales_8bit, scale_scale

Mixed‑precision training details

Forward pass: most ops in FP16, numerically sensitive ops in FP32.

Backward pass: gradients computed in FP16, accumulation in FP32.

Parameter update: performed in FP32 for stability.

Practical mixed‑precision training in PyTorch

import torch
from torch.cuda.amp import autocast, GradScaler

model = YourModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = GradScaler()
for batch in dataloader:
    optimizer.zero_grad()
    with autocast():
        outputs = model(batch['input_ids'])
        loss = criterion(outputs, batch['labels'])
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(optimizer)
    scaler.update()

Prompt Tuning, Prefix Tuning, P‑tuning

All three methods modify the input representation instead of the full weight matrix.

class PromptTuning(nn.Module):
    def __init__(self, model, prompt_length=20, embed_dim=768):
        super().__init__()
        self.model = model
        self.soft_prompt = nn.Parameter(torch.randn(prompt_length, embed_dim))
    def forward(self, input_ids):
        batch_size = input_ids.shape[0]
        inputs_embeds = self.model.get_input_embeddings()(input_ids)
        prompt_embeds = self.soft_prompt.unsqueeze(0).expand(batch_size, -1, -1)
        inputs_embeds = torch.cat([prompt_embeds, inputs_embeds], dim=1)
        return self.model(inputs_embeds=inputs_embeds)

class PrefixTuning(nn.Module):
    def __init__(self, model, prefix_length=20, num_layers=12, hidden_size=768):
        super().__init__()
        self.model = model
        self.prefix_length = prefix_length
        self.prefix_encoder = nn.ModuleList([
            nn.Sequential(nn.Linear(hidden_size, hidden_size),
                          nn.Tanh(),
                          nn.Linear(hidden_size, 2*hidden_size))
            for _ in range(num_layers)
        ])
        self.prefix_tokens = nn.Parameter(torch.randn(prefix_length, hidden_size))
    def get_prefix_states(self, batch_size):
        prefix_embeds = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1, -1)
        prefix_states = []
        for layer in self.prefix_encoder:
            prefix_kv = layer(prefix_embeds)
            prefix_k, prefix_v = prefix_kv.chunk(2, dim=-1)
            prefix_states.append((prefix_k, prefix_v))
        return prefix_states
    def forward(self, input_ids):
        batch_size = input_ids.shape[0]
        prefix_states = self.get_prefix_states(batch_size)
        # Hook each attention layer to add prefix K,V (implementation omitted)
        return self.model(input_ids, prefix_states=prefix_states)

class PTuning(nn.Module):
    def __init__(self, model, pattern="[P0][P1][P2] {text} [P3]", hidden_size=768):
        super().__init__()
        self.model = model
        self.pattern = pattern
        self.prompt_tokens = [t for t in pattern.split() if t.startswith('[P') and t.endswith(']')]
        self.prompt_embeddings = nn.ParameterDict({
            token: nn.Parameter(torch.randn(hidden_size)) for token in self.prompt_tokens
        })
        self.prompt_encoder = nn.Sequential(nn.Linear(hidden_size, hidden_size),
                                           nn.ReLU(),
                                           nn.Linear(hidden_size, hidden_size))
    def forward(self, input_ids, text_span):
        # Build mixed input sequence according to pattern (logic omitted)
        pass

Choosing the right method

Abundant GPU memory (>40 GB) : Prompt Tuning for simple tasks; LoRA (rank = 32‑64) for more demanding tasks.

Limited memory (16‑32 GB) : LoRA with low rank (8‑16) is the primary choice; Prefix Tuning as a backup.

Very low memory (<16 GB) : QLoRA is the only viable solution, optionally combined with gradient checkpointing.

Engineering recommendations

Start with LoRA – it offers a good trade‑off between performance and implementation simplicity.

Iteratively increase rank or alpha only if the task demands higher capacity.

Monitor mixed‑precision stability (loss‑scale history, gradient norms) and adjust learning‑rate or scaling factors accordingly.

Run systematic experiments; the best method varies across datasets and model architectures.

large language models LoRA QLoRA prompt tuning parameter-efficient fine-tuning mixed precision

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.