Unlocking Efficient LLM Fine‑Tuning: LoRA, QLoRA, and DoRA Compared
This article examines three parameter‑efficient fine‑tuning (PEFT) techniques—LoRA, QLoRA, and DoRA—explaining their core mechanisms, providing implementation code, benchmark results, memory and speed trade‑offs, and offering guidance on which method best fits different hardware and accuracy requirements.
Background
Large language models (LLMs) such as LLaMA, Mistral, and Qwen often contain tens of billions of parameters, making full‑parameter fine‑tuning prohibitively expensive in terms of GPU memory (e.g., a 65B model in float16 needs ~130 GB) and compute time.
Researchers therefore focus on Parameter‑Efficient Fine‑Tuning (PEFT), which updates only a small subset of parameters while keeping the bulk of the pretrained weights frozen.
LoRA – Low‑Rank Adaptation
Core Idea
LoRA freezes the original weight matrix W and adds two small trainable matrices A and B. During forward propagation the output becomes: output = W·x + (B·A)·x × (alpha/r) Only A and B are updated; W remains unchanged. Typical rank r values are 4, 8, or 16, and alpha scales the update strength.
Parameter reduction is dramatic: for a (4096, 4096) matrix, full parameters are ~16.7 M, while LoRA with r=8 uses only 65 k trainable parameters (≈99.6 % fewer).
Reference Implementation (Microsoft)
# Source: github.com/microsoft/LoRA — loralib/layers.py
class Linear(nn.Linear, LoRALayer):
def __init__(self, in_features: int, out_features: int, r: int = 0,
lora_alpha: int = 1, lora_dropout: float = 0.,
merge_weights: bool = True, **kwargs):
nn.Linear.__init__(self, in_features, out_features, **kwargs)
LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
merge_weights=merge_weights)
if r > 0:
self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
self.scaling = self.lora_alpha / self.r
self.weight.requires_grad = False
def forward(self, x: torch.Tensor):
if self.r > 0 and not self.merged:
result = F.linear(x, self.weight, bias=self.bias)
result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @
self.lora_B.transpose(0, 1)) * self.scaling
return result
else:
return F.linear(x, self.weight, bias=self.bias)Applying LoRA to Attention
# Source: github.com/microsoft/LoRA — README
import loralib as lora
# Before: qkv_proj = nn.Linear(d_model, 3*d_model)
# After: apply LoRA to Q and V, freeze K
qkv_proj = lora.MergedLinear(
d_model, 3*d_model,
r=8,
enable_lora=[True, False, True] # Q=LoRA, K=frozen, V=LoRA
)
lora.mark_only_lora_as_trainable(model)Benchmark (LoRA paper)
| Method | Trainable Params | BLEU | NIST | MET | ROUGE-L | CIDEr |
|----------------------|------------------|------|------|-----|----------|-------|
| Full Fine‑Tuning | 117M | 68.2 | 8.62 | 46.2| 71.0 | 2.47 |
| Adapter (Houlsby) | 1.0M | 66.3 | 8.41 | 45.0| 69.8 | 2.40 |
| Prefix Tuning | 0.35M | 68.1 | 8.59 | 46.3| 70.8 | 2.47 |
| LoRA (r=4) | 0.77M | 70.4 | 8.85 | 46.8| 71.8 | 2.53 |LoRA matches or exceeds full fine‑tuning while training less than 1 % of the parameters, thanks to a regularization effect of the low‑rank constraint.
Key Hyper‑Parameters
Rank r: typically 4–16; larger values increase capacity and parameter count.
Scaling factor alpha: effective learning rate is alpha/r; a common rule is alpha = 2 × r.
Target modules: usually the q_proj and v_proj layers in attention.
QLoRA – Quantized LoRA
While LoRA reduces trainable parameters, the base model must still be loaded in full precision, which remains memory‑intensive for very large models.
QLoRA first quantizes the base model to 4‑bit NF4 (a normal‑float‑oriented format) and then applies LoRA adapters in 16‑bit, allowing fine‑tuning on modest GPUs.
Three Technical Innovations (QLoRA paper)
NF4 4‑bit format : optimized for normally distributed weights, offering higher information density than INT4 or FP4.
Double quantization : the quantization constants themselves are quantized, saving ~3 GB on a 65B model.
Gradient‑checkpoint offloading : when GPU memory is insufficient, optimizer states are automatically moved to CPU RAM and swapped back as needed.
Implementation Example
# Source: github.com/artidoro/qlora — qlora.py
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# Step 1: 4‑bit NF4 config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Step 2: Load base model in 4‑bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# Step 3: Prepare for k‑bit training
model = prepare_model_for_kbit_training(model)
# Step 4: Apply LoRA adapters (16‑bit) on top
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)Benchmark (QLoRA paper)
| Method | Model | Memory Usage | Vicuna Score vs ChatGPT |
|----------------------|-------|--------------|--------------------------|
| Full Fine‑Tuning (fp16) | 65B | >780 GB | — |
| LoRA (fp16) | 65B | ~130 GB | — |
| QLoRA (NF4) | 65B | ~48 GB | 99.3 % |
| QLoRA (NF4) | 33B | ~24 GB | 97.8 % |
| QLoRA (NF4) | 7B | ~5 GB | ~87 % |QLoRA enables fine‑tuning a 65B model on a single 48 GB GPU and a 7B model on a 5 GB GPU, achieving near‑full‑model performance.
DoRA – Weight‑Decomposed Low‑Rank Adaptation
LoRA still lags behind full fine‑tuning, especially when the rank is low, because it updates only the directional component of weights.
Core Insight
Any weight matrix W can be decomposed into a magnitude scalar m and a unit‑direction vector V/||V||: W = m × (V / ||V||_c) Full fine‑tuning updates both magnitude and direction jointly. LoRA ties them together, limiting expressiveness. DoRA decouples them: the magnitude m is learned as an independent scalar per output neuron, while LoRA matrices modify only the direction.
W' = (m + Δm) × ((V + ΔV_LoRA) / ||V + ΔV_LoRA||_c)During inference the magnitude and direction are merged back into a single weight matrix, incurring zero extra cost.
DoRA Layer Implementation (NVlabs)
# Source: github.com/NVlabs/DoRA
import torch
import torch.nn as nn
import torch.nn.functional as F
class DoRALayer(nn.Module):
def __init__(self, d_in, d_out, rank, lora_alpha):
super().__init__()
# Frozen pretrained weight
self.weight = nn.Parameter(torch.randn(d_out, d_in), requires_grad=False)
# Learnable magnitude per output neuron
self.m = nn.Parameter(self.weight.norm(p=2, dim=1, keepdim=True))
# LoRA matrices for directional updates
std = 1 / torch.sqrt(torch.tensor(rank).float())
self.lora_A = nn.Parameter(torch.randn(d_in, rank) * std)
self.lora_B = nn.Parameter(torch.zeros(rank, d_out))
self.rank = rank
self.scaling = lora_alpha / rank
def forward(self, x):
# Directional update from LoRA
lora_update = (self.lora_A @ self.lora_B).T * self.scaling
adapted = self.weight + lora_update
# Normalize columns to unit direction vectors
column_norms = adapted.norm(p=2, dim=1, keepdim=True)
V_normalized = adapted / column_norms
# Scale by learned magnitude
effective_weight = self.m * V_normalized
return F.linear(x, effective_weight)Enabling DoRA via HuggingFace PEFT
# Source: HuggingFace PEFT documentation
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
use_dora=True, # single flag activates DoRA
)
model = get_peft_model(model, lora_config)Benchmark (DoRA paper)
| Method | BoolQ | PIQA | HellaSwag | WinoGrande | ARC-e | ARC-c | OBQA | Avg |
|----------------------|-------|------|----------|------------|-------|-------|------|-------|
| Full Fine‑Tuning | 69.4 | 82.3 | 89.7 | 82.4 | 79.8 | 60.7 | 81.6 | 77.99 |
| LoRA (r=32) | 68.9 | 80.9 | 90.0 | 82.1 | 78.2 | 59.8 | 80.0 | 77.13 |
| DoRA (r=32) | 70.0 | 83.6 | 91.0 | 83.0 | 81.4 | 65.8 | 83.4 | 79.75 |DoRA consistently outperforms LoRA across eight commonsense reasoning datasets, with especially large gains on the harder ARC‑c set.
Memory Requirements
| Method | LLaMA 7B | LLaMA 13B | LLaMA 33B | LLaMA 65B |
|----------------------|----------|-----------|-----------|-----------|
| Full Fine‑Tuning (fp16) | ~28 GB | ~52 GB | ~130 GB | ~260 GB |
| LoRA (fp16) | ~14 GB | ~26 GB | ~65 GB | ~130 GB |
| QLoRA (NF4) | ~5 GB | ~8 GB | ~20 GB | ~48 GB |
| DoRA (fp16) | ~14 GB | ~26 GB | ~65 GB | ~130 GB |DoRA adds a negligible magnitude vector, so its memory overhead is comparable to LoRA.
Performance vs Full Fine‑Tuning
| Method | Commonsense Reasoning | Instruction Tuning | Memory |
|----------|----------------------|-------------------|--------|
| Full FT | Baseline | Baseline | Very High |
| LoRA | -0.86 avg | Comparable | Medium |
| QLoRA | ~Same as LoRA | 99.3 % of ChatGPT | Low |
| DoRA | +2.62 avg over LoRA | Better than LoRA | Medium |Training Speed (Relative)
| Method | Speed |
|--------|-------|
| LoRA | Fast |
| DoRA | Fast (near identical to LoRA) |
| QLoRA | Moderate (quantize/dequantize overhead) |Best Choice per Scenario
When to Use LoRA
GPU memory ≥16 GB, model size ≤13 B, and you need a stable, well‑supported solution. LoRA is the foundation of the HuggingFace PEFT ecosystem and the easiest entry point for newcomers.
When to Use QLoRA
Limited GPU memory but desire to fine‑tune very large models (30 B +). QLoRA makes a 65 B model trainable on a single 48 GB GPU and even a 20 B+ model feasible on a free Colab T4 (15 GB).
When to Use DoRA
Same parameter budget as LoRA but you want the highest possible accuracy. Enabling use_dora=True yields a drop‑in upgrade with zero inference overhead and noticeable gains on complex reasoning tasks.
When to Use QLoRA + DoRA (QDoRA)
Combine memory savings of QLoRA with DoRA’s accuracy boost; early experiments show QDoRA can match or surpass full fine‑tuning.
Conclusion
In the 2025 fine‑tuning landscape, LoRA serves as the base layer (simple, fast, mature ecosystem), QLoRA opens the door to consumer‑grade GPUs for large‑scale models with minimal accuracy loss, and DoRA offers a free upgrade that matches LoRA’s cost while delivering better results. The only decision left is which PEFT method aligns best with your hardware constraints and performance goals.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
