LoRA vs QLoRA vs Full Fine‑Tuning: Which Method Wins for Large‑Model Adaptation?
This article provides a practical, data‑driven comparison of Full Fine‑Tuning, LoRA, and QLoRA for adapting 7B‑70B open‑source LLMs, detailing memory requirements, training speed, cost, performance trade‑offs, step‑by‑step workflows, code examples, evaluation metrics, common pitfalls, and optimization tips to help engineers choose the most suitable fine‑tuning approach for their data and budget.
Motivation and Real‑World Incident
An engineering team tried to fine‑tune Qwen‑7B on 2,000 internal QA pairs using Full Fine‑Tuning (FT). The first attempt OOMed on an 8×A100 80GB cluster after three steps. Switching to DeepSpeed ZeRO‑3 allowed training but slowed it down six‑fold. Finally, using LoRA on a single 24GB GPU finished in four hours with comparable performance, costing only one‑eighth of Full FT.
What Fine‑tuning Solves
Fine‑tuning enables a generic model to perform well on a specific vertical domain by updating model weights, unlike Retrieval‑Augmented Generation (RAG) which only supplies external knowledge without changing the model.
When to Use Each Method
Full Fine‑tuning : suitable for large teams with 8+ A100/H100 GPUs, >100k data points, and a need for maximal performance. Not suitable for individual developers, single‑GPU setups, <10k data points, or frequent iteration.
LoRA : the default choice for most mid‑size teams on a single or dual GPU with 1k‑100k data points. Not suitable when the task diverges drastically from pre‑training knowledge.
QLoRA : fits memory‑constrained (24GB GPU) scenarios, allowing fine‑tuning of 13B‑70B models with ~30% slower training. Not suitable for latency‑critical deployments or extremely large batch sizes.
Rough Memory‑Based Heuristics
GPU memory ≥ 4× model parameters (FP16) → Full FT is feasible.
GPU memory ≈ 1.5× model parameters → LoRA.
GPU memory ≈ 0.5× model parameters → QLoRA.
For a 7B model (≈14 GB FP16): Full FT needs ≥56 GB, LoRA runs with ~24 GB, QLoRA fits in ~12 GB.
Architectural Differences
Full FT:
Input → [W: weights] → [∂W: gradients] → [Adam m,v]
Memory ≈ 84 GB (weights+gradients+optimizer) + activations
LoRA:
Input → [W frozen] + low‑rank updates A·B (r≈8‑64)
LoRA params ≈ 30 MB, total ≈ 30 GB (model + LoRA)
QLoRA:
Input → [W quantized to 4‑bit NF4] + LoRA
Quantization cuts weight memory 4×, total ≈ 18 GB (model+LoRA)Key Concepts
Full FT : stores full FP32 weights, gradients, and optimizer states; a 7B model needs ≈84 GB.
LoRA : freezes the base model and trains two small matrices A (in×r) and B (r×out). With r=16 the trainable parameters are only 0.1‑0.5 % of the original.
QLoRA : quantizes weights to NF4 (4‑bit NormalFloat) before applying LoRA, reducing memory by ~4×.
Step‑by‑Step Workflow
Data Preparation
Clean : deduplicate, strip HTML, filter low‑quality rows.
Format : apply the model’s chat template with apply_chat_template (critical step that many skip).
Tokenize : produce records of {input_ids, labels, attention_mask}.
Common issues : length truncation (use max_length=2048 ), data imbalance (down‑sample dominant classes), and mismatched chat templates across models (Qwen, ChatGLM, Llama each have their own).
Model Loading
Full FT : AutoModelForCausalLM.from_pretrained(...) directly.
LoRA : load the model then call get_peft_model with a LoraConfig.
QLoRA : create a BitsAndBytesConfig (4‑bit NF4, double‑quant, compute dtype bf16), load the quantized model, then run prepare_model_for_kbit_training before applying LoRA.
Training Parameter Configuration (LoRA on 7B)
learning_rate: 2e‑4 for LoRA, 2e‑5 for Full FT (LoRA can use a learning rate ten times larger). batch_size: 4‑32 per device (with gradient accumulation). Single‑GPU 4‑8, multi‑GPU 32+. gradient_accumulation_steps: 8‑16 (effective batch 32‑128). num_train_epochs: 3‑5 (LoRA tends to overfit after 5 epochs). warmup_ratio: 0.03‑0.1 (3‑10 % of steps). lr_scheduler_type: cosine (more stable than linear). max_grad_norm: 1.0 (gradient clipping to avoid loss explosion).
Training
Instantiate an SFTTrainer (or Trainer) with the model, arguments, and tokenized dataset, then call trainer.train(). For Full FT, enable DeepSpeed ZeRO‑2 and gradient_checkpointing=True to fit into GPU memory. attn_implementation="flash_attention_2": ~30 % faster, ~20 % less memory. gradient_checkpointing=True: trades compute for 30‑50 % memory savings. deepspeed="ds_config_zero2.json": only for Full FT.
Merging and Export
After training, merge LoRA/QLoRA adapters into a single checkpoint:
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./output_merged", safe_serialization=True)
tokenizer.save_pretrained("./output_merged")For QLoRA, convert the merged model back to torch.bfloat16 before saving.
Empirical Memory & Speed Comparison (Qwen2‑7B, seq‑len 2048)
Full FT (BF16 + ZeRO‑2, 8 GPUs) : 72 GB × 8 (peak 78 GB), 1.4 s/step, 100 % effectiveness.
LoRA (BF16, single GPU) : 22 GB, 0.8 s/step, 99.0‑99.5 % effectiveness.
QLoRA (4‑bit, single GPU) : 11.5 GB, 1.1 s/step, 98.0‑99.0 % effectiveness.
Takeaway : LoRA reaches ~99 % of Full FT performance while using only one‑third of the memory and training time.
Post‑Deployment Evaluation
Task‑Specific Metrics
Classification – Accuracy / F1 (scikit‑learn).
Extraction (NER/QA) – Exact Match / F1 (SQuAD script).
Generation (summarization/translation) – ROUGE / BLEU (rouge‑score / sacrebleu).
JSON Schema output – parse success + field accuracy (jsonschema library).
Free‑form QA – human blind rating + LLM‑as‑Judge (GPT‑4 scoring).
General‑Ability Checks (catastrophic forgetting)
MMLU – EleutherAI/lm‑evaluation‑harness, acceptable drop ≤ 5 %.
HumanEval (code) – same harness, drop ≤ 10 %.
Chinese C‑Eval – drop ≤ 5 %.
GSM8K (math) – drop ≤ 10 %.
Inference‑Cost Metrics
P50 / P99 latency – measured with wrk or locust load test.
Cost per token – GPU‑hours divided by total generated tokens.
Peak GPU memory – measured with vLLM gpu-memory-utilization setting.
Common Pitfalls (12 Practical Tips)
Missing target_modules (e.g., lm_head ) : generation quality drops. Verify trainable layers with model.print_trainable_parameters(). For generation tasks use target_modules="all-linear".
Chat template mismatch : loss may decrease while inference produces garbled text. Set the correct template explicitly: tokenizer.chat_template = "...your template..." Skipping prepare_model_for_kbit_training in QLoRA : leads to instability or OOM. This step must be called before applying LoRA on a quantized model.
Over‑large LoRA rank : r=64 or 128 on <1k data causes severe over‑fitting (train loss 0.3, test loss 1.5). Recommended ranks:
Data < 1k: r=8, alpha=16
1k‑10k: r=16, alpha=32
> 100k: r=32‑64, alpha=64‑128
Confusing learning rates : Full FT typically uses 1e‑5 – 5e‑5, while LoRA/QLoRA need 1e‑4 – 3e‑4 because LoRA updates are smaller.
Improper gradient accumulation : effective batch size should stay between 32 and 128. Example settings – single GPU: batch = 4, accumulation = 8 → 32; 8‑GPU: batch = 2, accumulation = 4 per GPU → 64.
DataLoader deadlock on Windows : set num_workers=0 on Windows; keep a positive value (e.g., 4) on Linux.
Flash‑Attention installation failure : on Linux follow the official README; on Windows fall back to PyTorch’s built‑in sdpa (acceptable ~30 % slowdown) or use xformers if available.
Post‑merge performance drop : ensure dtype consistency after merge_and_unload() and disable LoRA dropout during inference:
model = model.merge_and_unload()
model = model.to(torch.bfloat16)Prompt injection in training data : filter sensitive or adversarial prompts before training, optionally apply RLHF or a reward model, and perform at least 200 manual spot checks.
OOM due to occasional very long samples : hard‑truncate inputs (e.g., max_length=2048) and enable group_by_length=True to batch similar‑length examples together.
vLLM “Unknown quantization” error after LoRA merge : save the merged checkpoint with safe_serialization=True and load it directly with vLLM:
model.save_pretrained("./output", safe_serialization=True, max_shard_size="5GB")
vllm serve ./output --tensor-parallel-size 2Optimization Directions (From "Works" to "Works Great")
Prioritize data quality over quantity; 2k high‑quality examples beat 20k noisy ones.
Start with Supervised Fine‑Tuning (SFT) then apply DPO/PPO for preference alignment.
Merge LoRA adapters and quantize (AWQ 4‑bit) for up to 4× inference cost reduction.
Data augmentation (e.g., GPT‑4 generated paraphrases) can noticeably boost performance.
Multi‑task fine‑tuning improves generalization.
NEFTune (additive noise to embeddings) yields 2‑5 % gains at zero cost.
LongLoRA / QLoRA for extended context windows (4k → 32k).
Model Soup: average several LoRA checkpoints for better results.
Continual pre‑training on large unsupervised corpora before SFT helps when switching domains.
Decision Flow
Data < 1k rows → expand data first, avoid immediate training.
GPU memory ≥ 4× model size → Full FT.
GPU memory ≈ 1.5× model size → LoRA (default recommendation).
GPU memory ≈ 0.5× model size → QLoRA.
High inference cost after training → quantize + vLLM.
Conclusion
Fine‑tuning teaches a model new domain‑specific behavior without erasing its general capabilities. Full FT offers the highest ceiling at the highest cost, LoRA is the sweet spot for ~80 % of scenarios, and QLoRA is the only viable option when GPU memory is tight enough to fit only a fraction of the model. Distributed training (ZeRO, FSDP, pipeline parallelism) is only needed for models larger than 30B or when the dataset is too big for a single GPU.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
