Unlocking Efficient Large Model Fine‑Tuning: LoRA, LoRA+, rsLoRA, DoRA & PiSSA Explained
This article introduces the fundamentals of large‑model fine‑tuning, compares popular parameter‑efficient methods such as LoRA and its variants, presents experimental results on the Qwen2.5‑7B model, and discusses current challenges and future research directions.
1. Basic Concepts of Large Model Fine‑Tuning
Large models are trained in two stages: pre‑training on massive unlabeled data to learn general language knowledge, and fine‑tuning on task‑specific data to adapt the model to particular applications.
1.1 Pre‑training Stage
During pre‑training, the model learns statistical properties of language in an unsupervised manner, building a versatile base model with strong prediction and generation capabilities.
1.2 Fine‑tuning Stage
Fine‑tuning refines the model’s weights on a specific downstream dataset, improving performance for the target task.
1.3 Essence of Fine‑tuning
Fine‑tuning injects domain‑specific knowledge by training on specialized data, enabling the model to adapt to new environments and broaden its applicability.
1.4 Benefits of Fine‑tuning
Task Adaptation : Improves performance on specific downstream tasks.
Data Efficiency : Requires only a small amount of task‑specific data.
Knowledge Transfer : Leverages pre‑trained knowledge for new tasks.
2. Common Fine‑tuning Methods
Full fine‑tuning updates all layers and parameters, often using a small learning rate and task data, but can be computationally expensive.
Parameter‑Efficient Fine‑Tuning (PEFT) reduces the number of trainable parameters and computational cost. PEFT includes methods such as LoRA, QLoRA, Adapter Tuning, Prefix Tuning, Prompt Tuning, P‑Tuning, and P‑Tuning v2.
3. LoRA and Its Variants
3.1 LoRA (Low‑Rank Adaptation)
LoRA inserts low‑rank matrices into the weight matrices of a pre‑trained model, allowing task‑specific information to be captured while keeping the original weights frozen. This drastically reduces the number of trainable parameters (e.g., only 0.01% of GPT‑3’s parameters) and memory usage.
Key steps:
Rank Decomposition : Approximate the weight change ΔW with two low‑rank matrices A and B (ΔW ≈ A·Bᵀ).
Inject Low‑Rank Matrices : Combine A and B with the original weights in selected Transformer layers.
Keep Pre‑trained Weights Fixed : Preserve the knowledge of the base model while adapting to new tasks.
3.2 LoRA+
LoRA+ assigns different learning rates to the low‑rank matrices A and B, addressing the issue that B (initialized to zero) receives insufficient gradient updates. Empirically, λ=16 yields improved accuracy and training speed.
3.3 rsLoRA
rsLoRA introduces a rank‑stabilization mechanism that scales the factor γ = α/√r instead of dividing by the rank itself, ensuring stable updates of low‑rank matrices even for high‑rank settings.
3.4 DoRA (Weight‑Decomposed Low‑Rank Adaptation)
DoRA decomposes each weight into magnitude and direction components. The direction is adapted using low‑rank techniques (like LoRA), while the magnitude is tuned separately, offering finer control and reduced memory consumption.
3.5 PiSSA (Principal Singular Values and Singular Vectors Adaptation)
PiSSA differs from LoRA in initialization: it performs singular value decomposition (SVD) on the weight matrix W, initializing low‑rank matrices with principal singular values/vectors, while keeping a residual matrix frozen. Only the low‑rank matrices are trained.
4. Fine‑tuning Experiments on the TAI Platform
Experiments were conducted on the Qwen2.5‑7B model using LoRA, LoRA+, rsLoRA, DoRA, and PiSSA with the same dataset and CEval evaluation. The configuration included lora_rank=8, lora_alpha=16, learning_rate=1e‑4, and a cosine scheduler.
<code>model_name_or_path: models/huggingface/Qwen/Qwen2___5-7B-Instruct
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 8
lora_alpha: 16
lora_dropout: 0.05
dataset: ceval-dev-val-01
template: qwen
cutoff_len: 1024
...
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 1.0e-4
num_train_epochs: 1
lr_scheduler_type: cosine
</code>Results:
LoRA+ achieved the highest average score (88.41) across STEM, Social Sciences, Humanities, and Other domains.
rsLoRA and PiSSA obtained moderate scores (83.51 and 82.47 respectively), outperforming the baseline LoRA.
DoRA performed relatively lower (81.05) but still better than the original LoRA (79.94).
5. Challenges and Future Directions
Data Privacy & Security : Protecting user data during large‑scale model training.
Data Quality & Annotation Cost : Obtaining high‑quality labeled data remains expensive.
Automated Fine‑tuning & AutoML : Developing smarter hyper‑parameter selection and model adaptation.
Model Compression & Fine‑tuning Integration : Combining techniques like distillation and pruning with fine‑tuning to reduce model size while preserving performance.
6. Conclusion
Large‑model fine‑tuning is a pivotal research area that balances preserving pre‑trained capabilities with efficient task adaptation. Methods ranging from full‑parameter updates to parameter‑efficient approaches such as LoRA and its variants each have trade‑offs, with LoRA emerging as a popular choice due to its efficiency and strong performance. Ongoing advances will broaden the applicability of fine‑tuning across diverse scenarios.
References
Yu Y, et al. Low‑rank adaptation of large language model rescoring for parameter‑efficient speech recognition. 2023 IEEE ASRU.
Hayou S, et al. LoRA+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354, 2024.
Kalajdzievski D. A rank stabilization scaling factor for fine‑tuning with LoRA. arXiv preprint arXiv:2312.03732, 2023.
Liu S Y, et al. DoRA: Weight‑decomposed low‑rank adaptation. arXiv preprint arXiv:2402.09353, 2024.
Meng F, et al. PiSSA: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948, 2024.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.