Artificial Intelligence 25 min read

Master LLaMA Factory Fine‑Tuning: Key Parameter Settings & Memory Optimization

This tutorial walks through LLaMA‑Factory fine‑tuning by explaining how to choose learning rate, epochs, batch size, cutoff length, LoRA rank, and validation split, and shows how to estimate and reduce GPU memory usage with techniques like gradient accumulation, liger_kernel, and DeepSpeed.

Sohu Tech Products

Jun 18, 2025

Master LLaMA Factory Fine‑Tuning: Key Parameter Settings & Memory Optimization

Adjust fine‑tuning parameters

In model fine‑tuning, hyper‑parameters act like a teaching plan that determines how intensively and in which direction the model learns. Choosing inappropriate values can lead to poor performance.

Common questions such as “which settings give the best results?” have no universal answer; they depend on the model, dataset, and hardware.

Learning Rate

Core concept : Controls the magnitude of parameter updates per step, typically between 0 and 1.

Plain explanation : A larger learning rate makes bigger jumps (fast progress but risk of overshooting); a smaller rate makes finer adjustments (stable but slower).

Personal experience : Keep it between 5e‑5 and 4e‑5; avoid large rates for full‑parameter fine‑tuning.

Memory impact : Almost none.

Chosen value : 5e‑5 (0.00005).

Number of Epochs

Core concept : One epoch means the model has seen the entire training set once.

Plain explanation : Too few epochs may under‑fit; too many can cause over‑fitting.

Personal experience : Usually 3 epochs are enough; stop early if loss plateaus, avoid loss < 0.5‑1.5 to prevent over‑fitting.

Memory impact : Almost none.

Chosen value : 3.

Batch Size

Core concept : Number of samples processed before each parameter update.

Plain explanation : Larger batches speed up training but consume more memory; smaller batches are memory‑friendly but slower.

Memory impact : Significant – larger batch size increases memory linearly.

Practical calculation : per_device_train_batch_size * gradient_accumulation_steps gives the effective batch size.

Personal experience : For limited GPU memory, start with batch size 1 and increase gradient accumulation steps (e.g., 8).

Chosen values : per‑device batch size = 1, gradient accumulation steps = 8.

Cutoff length (Max Length)

Core concept : Maximum number of tokens a model can process per sample.

Memory impact : Increases roughly linearly with length; doubling length roughly doubles memory.

Personal experience : Set to 4096 for the security dataset (99.83% of samples < 4000 tokens); adjust based on hardware.

Chosen value : 4096.

LoRA rank

Core concept : Determines the capacity of the low‑rank adaptation matrix; higher rank gives more expressive power but uses more memory.

Plain explanation : Rank 4 = few “thinking templates”, stable but limited; rank 64 = many templates, powerful but prone to over‑fitting.

Personal experience : Start with rank 8‑16; avoid < 8 for small datasets.

Memory impact : Minor (≈0.5 GB for 7B model).

Chosen value : 8.

Validation set proportion

Core concept : Fraction of data reserved for evaluating model performance during training.

Personal experience : Small datasets (<1000 samples) – 0.1‑0.2; large datasets (>10 000) – 0.05‑0.1.

Chosen value : 0.15 (431 samples).

Start fine‑tuning task

Preview command

After configuring the parameters, click “Preview command” to generate the full llamafactory-cli train command.

llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path /root/autodl-tmp/Qwen/Qwen2.5-7B-Instruct \
    --preprocessing_num_workers 16 \
    --finetuning_type lora \
    --template qwen \
    --flash_attn auto \
    --dataset_dir data \
    --dataset security \
    --cutoff_len 4096 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --packing False \
    --report_to none \
    --use_swanlab True \
    --output_dir /root/autodl-tmp/models/security007 \
    --bf16 True \
    --plot_loss True \
    --trust_remote_code True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --optim adamw_torch \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --lora_target all \
    --swanlab_project security007 \
    --swanlab_mode cloud \
    --val_size 0.15 \
    --eval_strategy steps \
    --eval_steps 100 \
    --per_device_eval_batch_size 1

Training process

Click “Start training” and monitor the terminal output and the LLaMA Board loss curve.

Total steps = (samples / (batch size × gradient accumulation)) × epochs; the last incomplete batch is dropped.

Memory consumption estimation

The hardware used includes two 48 GB GPUs. Memory usage can be broken down into four parts:

Base model weights : 70 B parameters × 2 bytes (BF16) ≈ 14 GB.

Framework overhead : Roughly 1 GB.

LoRA adapters : Approx. 0.5 GB for rank 8.

Activations : Batch size 1 × cutoff 4096 tokens → 4 K tokens; each 1 K tokens adds ~2.5 GB, so ≈10 GB.

Base model weights: 14 GB
Framework overhead: 1 GB
LoRA adapters: 0.5 GB
Activations: 10 GB
---------------------
Total ≈ 25.5 GB

Memory‑optimisation tricks: liger_kernel

Enabling liger_kernel rewrites key Transformer ops in Triton and fuses them, reducing activation memory. Experiments show memory growth drops from 2.5 GB per 1 K tokens to ~0.6 GB.

Distributed memory optimisation: DeepSpeed

Multi‑GPU training still stores the full model on each card unless ZeRO (DeepSpeed) is used. DeepSpeed Stage 3 shards parameters across GPUs, dramatically reducing per‑GPU memory.

With Stage 3 on two RTX 4090 (24 GB) cards, memory per card is ~16.3 GB (total ≈32.6 GB), slightly higher than the theoretical 30.5 GB due to communication overhead.

Next episode will cover loss monitoring and model export/deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Memory optimization fine-tuning LoRA LLaMA DeepSpeed

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.