Artificial Intelligence 20 min read

Unlocking Efficient LLM Fine‑Tuning: LoRA, QLoRA, and DoRA Compared

This article examines three parameter‑efficient fine‑tuning (PEFT) techniques—LoRA, QLoRA, and DoRA—explaining their core mechanisms, providing implementation code, benchmark results, memory and speed trade‑offs, and offering guidance on which method best fits different hardware and accuracy requirements.

Data Party THU

Mar 1, 2026

Unlocking Efficient LLM Fine‑Tuning: LoRA, QLoRA, and DoRA Compared

Background

Large language models (LLMs) such as LLaMA, Mistral, and Qwen often contain tens of billions of parameters, making full‑parameter fine‑tuning prohibitively expensive in terms of GPU memory (e.g., a 65B model in float16 needs ~130 GB) and compute time.

Researchers therefore focus on Parameter‑Efficient Fine‑Tuning (PEFT), which updates only a small subset of parameters while keeping the bulk of the pretrained weights frozen.

LoRA – Low‑Rank Adaptation

Core Idea

LoRA freezes the original weight matrix W and adds two small trainable matrices A and B. During forward propagation the output becomes: output = W·x + (B·A)·x × (alpha/r) Only A and B are updated; W remains unchanged. Typical rank r values are 4, 8, or 16, and alpha scales the update strength.

Parameter reduction is dramatic: for a (4096, 4096) matrix, full parameters are ~16.7 M, while LoRA with r=8 uses only 65 k trainable parameters (≈99.6 % fewer).

Reference Implementation (Microsoft)

# Source: github.com/microsoft/LoRA — loralib/layers.py
class Linear(nn.Linear, LoRALayer):
    def __init__(self, in_features: int, out_features: int, r: int = 0,
                 lora_alpha: int = 1, lora_dropout: float = 0.,
                 merge_weights: bool = True, **kwargs):
        nn.Linear.__init__(self, in_features, out_features, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
                           merge_weights=merge_weights)
        if r > 0:
            self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
            self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
            self.scaling = self.lora_alpha / self.r
            self.weight.requires_grad = False

    def forward(self, x: torch.Tensor):
        if self.r > 0 and not self.merged:
            result = F.linear(x, self.weight, bias=self.bias)
            result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @
                       self.lora_B.transpose(0, 1)) * self.scaling
            return result
        else:
            return F.linear(x, self.weight, bias=self.bias)

Applying LoRA to Attention

# Source: github.com/microsoft/LoRA — README
import loralib as lora
# Before: qkv_proj = nn.Linear(d_model, 3*d_model)
# After: apply LoRA to Q and V, freeze K
qkv_proj = lora.MergedLinear(
    d_model, 3*d_model,
    r=8,
    enable_lora=[True, False, True]  # Q=LoRA, K=frozen, V=LoRA
)
lora.mark_only_lora_as_trainable(model)

Benchmark (LoRA paper)

| Method               | Trainable Params | BLEU | NIST | MET | ROUGE-L | CIDEr |
|----------------------|------------------|------|------|-----|----------|-------|
| Full Fine‑Tuning     | 117M             | 68.2 | 8.62 | 46.2| 71.0    | 2.47 |
| Adapter (Houlsby)    | 1.0M             | 66.3 | 8.41 | 45.0| 69.8    | 2.40 |
| Prefix Tuning        | 0.35M            | 68.1 | 8.59 | 46.3| 70.8    | 2.47 |
| LoRA (r=4)           | 0.77M            | 70.4 | 8.85 | 46.8| 71.8    | 2.53 |

LoRA matches or exceeds full fine‑tuning while training less than 1 % of the parameters, thanks to a regularization effect of the low‑rank constraint.

Key Hyper‑Parameters

Rank r: typically 4–16; larger values increase capacity and parameter count.

Scaling factor alpha: effective learning rate is alpha/r; a common rule is alpha = 2 × r.

Target modules: usually the q_proj and v_proj layers in attention.

QLoRA – Quantized LoRA

While LoRA reduces trainable parameters, the base model must still be loaded in full precision, which remains memory‑intensive for very large models.

QLoRA first quantizes the base model to 4‑bit NF4 (a normal‑float‑oriented format) and then applies LoRA adapters in 16‑bit, allowing fine‑tuning on modest GPUs.

Three Technical Innovations (QLoRA paper)

NF4 4‑bit format : optimized for normally distributed weights, offering higher information density than INT4 or FP4.

Double quantization : the quantization constants themselves are quantized, saving ~3 GB on a 65B model.

Gradient‑checkpoint offloading : when GPU memory is insufficient, optimizer states are automatically moved to CPU RAM and swapped back as needed.

Implementation Example

# Source: github.com/artidoro/qlora — qlora.py
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# Step 1: 4‑bit NF4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Step 2: Load base model in 4‑bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# Step 3: Prepare for k‑bit training
model = prepare_model_for_kbit_training(model)

# Step 4: Apply LoRA adapters (16‑bit) on top
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

Benchmark (QLoRA paper)

| Method               | Model | Memory Usage | Vicuna Score vs ChatGPT |
|----------------------|-------|--------------|--------------------------|
| Full Fine‑Tuning (fp16) | 65B   | >780 GB      | —                        |
| LoRA (fp16)          | 65B   | ~130 GB      | —                        |
| QLoRA (NF4)          | 65B   | ~48 GB       | 99.3 %                  |
| QLoRA (NF4)          | 33B   | ~24 GB       | 97.8 %                  |
| QLoRA (NF4)          | 7B    | ~5 GB        | ~87 %                   |

QLoRA enables fine‑tuning a 65B model on a single 48 GB GPU and a 7B model on a 5 GB GPU, achieving near‑full‑model performance.

DoRA – Weight‑Decomposed Low‑Rank Adaptation

LoRA still lags behind full fine‑tuning, especially when the rank is low, because it updates only the directional component of weights.

Core Insight

Any weight matrix W can be decomposed into a magnitude scalar m and a unit‑direction vector V/||V||: W = m × (V / ||V||_c) Full fine‑tuning updates both magnitude and direction jointly. LoRA ties them together, limiting expressiveness. DoRA decouples them: the magnitude m is learned as an independent scalar per output neuron, while LoRA matrices modify only the direction.

W' = (m + Δm) × ((V + ΔV_LoRA) / ||V + ΔV_LoRA||_c)

During inference the magnitude and direction are merged back into a single weight matrix, incurring zero extra cost.

DoRA Layer Implementation (NVlabs)

# Source: github.com/NVlabs/DoRA
import torch
import torch.nn as nn
import torch.nn.functional as F

class DoRALayer(nn.Module):
    def __init__(self, d_in, d_out, rank, lora_alpha):
        super().__init__()
        # Frozen pretrained weight
        self.weight = nn.Parameter(torch.randn(d_out, d_in), requires_grad=False)
        # Learnable magnitude per output neuron
        self.m = nn.Parameter(self.weight.norm(p=2, dim=1, keepdim=True))
        # LoRA matrices for directional updates
        std = 1 / torch.sqrt(torch.tensor(rank).float())
        self.lora_A = nn.Parameter(torch.randn(d_in, rank) * std)
        self.lora_B = nn.Parameter(torch.zeros(rank, d_out))
        self.rank = rank
        self.scaling = lora_alpha / rank

    def forward(self, x):
        # Directional update from LoRA
        lora_update = (self.lora_A @ self.lora_B).T * self.scaling
        adapted = self.weight + lora_update
        # Normalize columns to unit direction vectors
        column_norms = adapted.norm(p=2, dim=1, keepdim=True)
        V_normalized = adapted / column_norms
        # Scale by learned magnitude
        effective_weight = self.m * V_normalized
        return F.linear(x, effective_weight)

Enabling DoRA via HuggingFace PEFT

# Source: HuggingFace PEFT documentation
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_dora=True,  # single flag activates DoRA
)
model = get_peft_model(model, lora_config)

Benchmark (DoRA paper)

| Method               | BoolQ | PIQA | HellaSwag | WinoGrande | ARC-e | ARC-c | OBQA | Avg   |
|----------------------|-------|------|----------|------------|-------|-------|------|-------|
| Full Fine‑Tuning     | 69.4  | 82.3 | 89.7     | 82.4       | 79.8  | 60.7  | 81.6 | 77.99 |
| LoRA (r=32)          | 68.9  | 80.9 | 90.0     | 82.1       | 78.2  | 59.8  | 80.0 | 77.13 |
| DoRA (r=32)          | 70.0  | 83.6 | 91.0     | 83.0       | 81.4  | 65.8  | 83.4 | 79.75 |

DoRA consistently outperforms LoRA across eight commonsense reasoning datasets, with especially large gains on the harder ARC‑c set.

Memory Requirements

| Method               | LLaMA 7B | LLaMA 13B | LLaMA 33B | LLaMA 65B |
|----------------------|----------|-----------|-----------|-----------|
| Full Fine‑Tuning (fp16) | ~28 GB   | ~52 GB    | ~130 GB   | ~260 GB   |
| LoRA (fp16)          | ~14 GB   | ~26 GB    | ~65 GB    | ~130 GB   |
| QLoRA (NF4)          | ~5 GB    | ~8 GB     | ~20 GB    | ~48 GB    |
| DoRA (fp16)          | ~14 GB   | ~26 GB    | ~65 GB    | ~130 GB   |

DoRA adds a negligible magnitude vector, so its memory overhead is comparable to LoRA.

Performance vs Full Fine‑Tuning

| Method   | Commonsense Reasoning | Instruction Tuning | Memory |
|----------|----------------------|-------------------|--------|
| Full FT  | Baseline             | Baseline          | Very High |
| LoRA     | -0.86 avg            | Comparable        | Medium |
| QLoRA    | ~Same as LoRA        | 99.3 % of ChatGPT | Low |
| DoRA     | +2.62 avg over LoRA  | Better than LoRA  | Medium |

Training Speed (Relative)

| Method | Speed |
|--------|-------|
| LoRA   | Fast |
| DoRA   | Fast (near identical to LoRA) |
| QLoRA | Moderate (quantize/dequantize overhead) |

Best Choice per Scenario

When to Use LoRA

GPU memory ≥16 GB, model size ≤13 B, and you need a stable, well‑supported solution. LoRA is the foundation of the HuggingFace PEFT ecosystem and the easiest entry point for newcomers.

When to Use QLoRA

Limited GPU memory but desire to fine‑tune very large models (30 B +). QLoRA makes a 65 B model trainable on a single 48 GB GPU and even a 20 B+ model feasible on a free Colab T4 (15 GB).

When to Use DoRA

Same parameter budget as LoRA but you want the highest possible accuracy. Enabling use_dora=True yields a drop‑in upgrade with zero inference overhead and noticeable gains on complex reasoning tasks.

When to Use QLoRA + DoRA (QDoRA)

Combine memory savings of QLoRA with DoRA’s accuracy boost; early experiments show QDoRA can match or surpass full fine‑tuning.

Conclusion

In the 2025 fine‑tuning landscape, LoRA serves as the base layer (simple, fast, mature ecosystem), QLoRA opens the door to consumer‑grade GPUs for large‑scale models with minimal accuracy loss, and DoRA offers a free upgrade that matches LoRA’s cost while delivering better results. The only decision left is which PEFT method aligns best with your hardware constraints and performance goals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Fine-tuning LoRA QLoRA PEFT DoRA Parameter Efficient

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

LoRA – Low‑Rank Adaptation

Core Idea

Reference Implementation (Microsoft)

Applying LoRA to Attention

Benchmark (LoRA paper)

Key Hyper‑Parameters

QLoRA – Quantized LoRA

Three Technical Innovations (QLoRA paper)

Implementation Example

Benchmark (QLoRA paper)

DoRA – Weight‑Decomposed Low‑Rank Adaptation

Core Insight

DoRA Layer Implementation (NVlabs)

Enabling DoRA via HuggingFace PEFT

Benchmark (DoRA paper)

Memory Requirements

Performance vs Full Fine‑Tuning

Training Speed (Relative)

Best Choice per Scenario

When to Use LoRA

When to Use QLoRA

When to Use DoRA

When to Use QLoRA + DoRA (QDoRA)

Conclusion

Data Party THU

How this landed with the community

Was this worth your time?

0 Comments

When to Use QLoRA + DoRA (QDoRA)