A Beginner's Deep Dive into Large‑Model Training Parameters with LLaMAFactory
This article walks readers through the three major training methods—full‑parameter, LoRA, and QLoRA—explaining their memory costs, data requirements, and trade‑offs, then provides a line‑by‑line breakdown of LLaMAFactory configuration files, hyper‑parameter tuning guidelines, and the process for merging LoRA adapters into a deployable model.
Training Methods Overview
The guide compares three representative training approaches supported by LLaMAFactory:
Full‑parameter training
All model layers are updated, which yields the most thorough fine‑tuning but demands the highest GPU memory (e.g., a LLaMA‑7B model in FP16 occupies ~14 GB, and the Adam optimizer adds another ~28 GB, easily exceeding 100 GB) and a massive, high‑quality dataset. Insufficient or low‑quality data can cause catastrophic forgetting , where the model loses its original language and reasoning abilities.
LoRA training
Only a lightweight adapter is trained while the base model remains frozen. This reduces memory consumption dramatically (a single consumer‑grade GPU with 8–12 GB VRAM can fine‑tune a 7 B model) and tolerates smaller or noisier datasets because the underlying knowledge is preserved. The adapter’s expressive power is controlled by lora_rank and lora_alpha.
QLoRA training
QLoRA further compresses the base model to 4‑bit NF4 quantization, cutting memory usage even more while keeping the same data‑tolerance benefits of LoRA. Empirical results show QLoRA reaches >95 % of LoRA’s performance on most NLP tasks, though very precision‑sensitive tasks may see slight degradation.
Choosing a Training Method
For most personal or small‑team projects, LoRA is recommended because it balances resource efficiency and performance. QLoRA is an option when GPU memory is extremely limited, but users should test for potential minor accuracy loss on sensitive tasks.
Configuration File Breakdown
The core configuration is divided into sections. Below are the most important fields and their meanings.
Model Section
model_name_or_path: /workspace/Qwen2_5_0_5
trust_remote_code: trueSpecifies the local path of the base model and enables loading of any custom code required by the model.
Method Section
stage: sft
finetuning_type: lora
lora_rank: 8
lora_target: allSets the training stage to supervised fine‑tuning, selects LoRA as the fine‑tuning type, and defines the adapter rank and target layers ("all" is the simplest choice; for finer control, target q_proj and v_proj).
Dataset Section
dataset: alpaca_zh_demo
template: qwen
cutoff_len: 2048
max_samples: 1000
preprocessing_num_workers: 16
dataloader_num_workers: 4Points to a registered dataset, selects the Qwen chat template, sets the maximum token length, limits the number of samples, and configures parallel workers for preprocessing and loading.
Output Section
output_dir: /workspace/test_sft/Qwen2_5_sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: noneDefines where the LoRA adapter is saved, how often loss is logged, checkpoint frequency, whether to overwrite existing output, and whether to save only the model weights.
Training Hyper‑parameters
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
num_train_epochs: 1.0
learning_rate: 1.0e-4
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: nullEffective batch size = per_device_train_batch_size × gradient_accumulation_steps (1 × 8 = 8). The guide explains how each parameter influences gradient estimation, learning‑rate scheduling, mixed‑precision training, and checkpoint recovery.
Validation Configuration (optional)
# eval_dataset: alpaca_en_demo
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500When enabled, a held‑out set is used to monitor generalisation and avoid over‑fitting.
LoRA Adapter Merging
After fine‑tuning, the adapter must be merged into the base model for inference. The merge configuration looks like:
model_name_or_path: /workspace/Qwen2_5_0_5
adapter_name_or_path: /workspace/test_sft/Qwen2_5_sft
template: qwen
trust_remote_code: true
export_dir: /workspace/test_sft/Qwen2_5_sft_all
export_size: 5
export_device: cpu
export_legacy_format: falseKey points: the base model path, the adapter path, the chat template, and the export options. export_device can be cpu (safer, slower) or auto (GPU if enough memory). export_legacy_format set to false produces a safetensors file; true yields the older pytorch_model.bin format.
Practical Tuning Guidelines
lora_rank : 4–8 for ≤7 B models, 8–16 for 13 B, lower (2–4) when memory is tight, higher (16–32) for very complex tasks.
lora_alpha : Often set equal to lora_rank (baseline). Increase to 2× rank if loss decreases slowly (under‑fitting); decrease if validation loss rises (over‑fitting). Keep the ratio α/r between 1 and 2.
learning_rate : Typical range 1e‑4 – 2e‑4 for LoRA. Reduce if loss oscillates or diverges; increase slightly if loss declines too slowly.
Batch size & gradient accumulation : Aim for an effective batch size ≥ 8. Reduce per_device_train_batch_size first when memory is limited, then compensate with larger gradient_accumulation_steps.
Warm‑up & scheduler : Use a 10 % warm‑up ratio and cosine decay to avoid early instability and to smooth the final convergence.
Summary
The guide provides a complete, step‑by‑step reference for configuring LLaMAFactory, selecting an appropriate fine‑tuning method, understanding each hyper‑parameter’s impact, and merging the resulting LoRA adapter into a standalone model ready for deployment.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
