Master LLaMA Factory Fine‑Tuning: Key Parameter Settings & Memory Optimization
This tutorial walks through LLaMA‑Factory fine‑tuning by explaining how to choose learning rate, epochs, batch size, cutoff length, LoRA rank, and validation split, and shows how to estimate and reduce GPU memory usage with techniques like gradient accumulation, liger_kernel, and DeepSpeed.
Adjust fine‑tuning parameters
In model fine‑tuning, hyper‑parameters act like a teaching plan that determines how intensively and in which direction the model learns. Choosing inappropriate values can lead to poor performance.
Common questions such as “which settings give the best results?” have no universal answer; they depend on the model, dataset, and hardware.
Learning Rate
Core concept : Controls the magnitude of parameter updates per step, typically between 0 and 1.
Plain explanation : A larger learning rate makes bigger jumps (fast progress but risk of overshooting); a smaller rate makes finer adjustments (stable but slower).
Personal experience : Keep it between 5e‑5 and 4e‑5; avoid large rates for full‑parameter fine‑tuning.
Memory impact : Almost none.
Chosen value : 5e‑5 (0.00005).
Number of Epochs
Core concept : One epoch means the model has seen the entire training set once.
Plain explanation : Too few epochs may under‑fit; too many can cause over‑fitting.
Personal experience : Usually 3 epochs are enough; stop early if loss plateaus, avoid loss < 0.5‑1.5 to prevent over‑fitting.
Memory impact : Almost none.
Chosen value : 3.
Batch Size
Core concept : Number of samples processed before each parameter update.
Plain explanation : Larger batches speed up training but consume more memory; smaller batches are memory‑friendly but slower.
Memory impact : Significant – larger batch size increases memory linearly.
Practical calculation :
per_device_train_batch_size * gradient_accumulation_stepsgives the effective batch size.
Personal experience : For limited GPU memory, start with batch size 1 and increase gradient accumulation steps (e.g., 8).
Chosen values : per‑device batch size = 1, gradient accumulation steps = 8.
Cutoff length (Max Length)
Core concept : Maximum number of tokens a model can process per sample.
Memory impact : Increases roughly linearly with length; doubling length roughly doubles memory.
Personal experience : Set to 4096 for the security dataset (99.83% of samples < 4000 tokens); adjust based on hardware.
Chosen value : 4096.
LoRA rank
Core concept : Determines the capacity of the low‑rank adaptation matrix; higher rank gives more expressive power but uses more memory.
Plain explanation : Rank 4 = few “thinking templates”, stable but limited; rank 64 = many templates, powerful but prone to over‑fitting.
Personal experience : Start with rank 8‑16; avoid < 8 for small datasets.
Memory impact : Minor (≈0.5 GB for 7B model).
Chosen value : 8.
Validation set proportion
Core concept : Fraction of data reserved for evaluating model performance during training.
Personal experience : Small datasets (<1000 samples) – 0.1‑0.2; large datasets (>10 000) – 0.05‑0.1.
Chosen value : 0.15 (431 samples).
Start fine‑tuning task
Preview command
After configuring the parameters, click “Preview command” to generate the full
llamafactory-cli traincommand.
<code>llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path /root/autodl-tmp/Qwen/Qwen2.5-7B-Instruct \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--template qwen \
--flash_attn auto \
--dataset_dir data \
--dataset security \
--cutoff_len 4096 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--max_samples 100000 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--packing False \
--report_to none \
--use_swanlab True \
--output_dir /root/autodl-tmp/models/security007 \
--bf16 True \
--plot_loss True \
--trust_remote_code True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--optim adamw_torch \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all \
--swanlab_project security007 \
--swanlab_mode cloud \
--val_size 0.15 \
--eval_strategy steps \
--eval_steps 100 \
--per_device_eval_batch_size 1</code>Training process
Click “Start training” and monitor the terminal output and the LLaMA Board loss curve.
Total steps = (samples / (batch size × gradient accumulation)) × epochs; the last incomplete batch is dropped.
Memory consumption estimation
The hardware used includes two 48 GB GPUs. Memory usage can be broken down into four parts:
Base model weights : 70 B parameters × 2 bytes (BF16) ≈ 14 GB.
Framework overhead : Roughly 1 GB.
LoRA adapters : Approx. 0.5 GB for rank 8.
Activations : Batch size 1 × cutoff 4096 tokens → 4 K tokens; each 1 K tokens adds ~2.5 GB, so ≈10 GB.
<code>Base model weights: 14 GB
Framework overhead: 1 GB
LoRA adapters: 0.5 GB
Activations: 10 GB
---------------------
Total ≈ 25.5 GB</code>Memory‑optimisation tricks: liger_kernel
Enabling liger_kernel rewrites key Transformer ops in Triton and fuses them, reducing activation memory. Experiments show memory growth drops from 2.5 GB per 1 K tokens to ~0.6 GB.
Distributed memory optimisation: DeepSpeed
Multi‑GPU training still stores the full model on each card unless ZeRO (DeepSpeed) is used. DeepSpeed Stage 3 shards parameters across GPUs, dramatically reducing per‑GPU memory.
With Stage 3 on two RTX 4090 (24 GB) cards, memory per card is ~16.3 GB (total ≈32.6 GB), slightly higher than the theoretical 30.5 GB due to communication overhead.
Next episode will cover loss monitoring and model export/deployment.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.