Artificial Intelligence 27 min read

A Beginner's Deep Dive into Large‑Model Training Parameters with LLaMAFactory

This article walks readers through the three major training methods—full‑parameter, LoRA, and QLoRA—explaining their memory costs, data requirements, and trade‑offs, then provides a line‑by‑line breakdown of LLaMAFactory configuration files, hyper‑parameter tuning guidelines, and the process for merging LoRA adapters into a deployable model.

Fun with Large Models

Apr 1, 2026

A Beginner's Deep Dive into Large‑Model Training Parameters with LLaMAFactory

Training Methods Overview

The guide compares three representative training approaches supported by LLaMAFactory:

Full‑parameter training

All model layers are updated, which yields the most thorough fine‑tuning but demands the highest GPU memory (e.g., a LLaMA‑7B model in FP16 occupies ~14 GB, and the Adam optimizer adds another ~28 GB, easily exceeding 100 GB) and a massive, high‑quality dataset. Insufficient or low‑quality data can cause catastrophic forgetting , where the model loses its original language and reasoning abilities.

LoRA training

Only a lightweight adapter is trained while the base model remains frozen. This reduces memory consumption dramatically (a single consumer‑grade GPU with 8–12 GB VRAM can fine‑tune a 7 B model) and tolerates smaller or noisier datasets because the underlying knowledge is preserved. The adapter’s expressive power is controlled by lora_rank and lora_alpha.

QLoRA training

QLoRA further compresses the base model to 4‑bit NF4 quantization, cutting memory usage even more while keeping the same data‑tolerance benefits of LoRA. Empirical results show QLoRA reaches >95 % of LoRA’s performance on most NLP tasks, though very precision‑sensitive tasks may see slight degradation.

Choosing a Training Method

For most personal or small‑team projects, LoRA is recommended because it balances resource efficiency and performance. QLoRA is an option when GPU memory is extremely limited, but users should test for potential minor accuracy loss on sensitive tasks.

Configuration File Breakdown

The core configuration is divided into sections. Below are the most important fields and their meanings.

Model Section

model_name_or_path: /workspace/Qwen2_5_0_5
trust_remote_code: true

Specifies the local path of the base model and enables loading of any custom code required by the model.

Method Section

stage: sft
finetuning_type: lora
lora_rank: 8
lora_target: all

Sets the training stage to supervised fine‑tuning, selects LoRA as the fine‑tuning type, and defines the adapter rank and target layers ("all" is the simplest choice; for finer control, target q_proj and v_proj).

Dataset Section

dataset: alpaca_zh_demo
template: qwen
cutoff_len: 2048
max_samples: 1000
preprocessing_num_workers: 16
dataloader_num_workers: 4

Points to a registered dataset, selects the Qwen chat template, sets the maximum token length, limits the number of samples, and configures parallel workers for preprocessing and loading.

Output Section

output_dir: /workspace/test_sft/Qwen2_5_sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none

Defines where the LoRA adapter is saved, how often loss is logged, checkpoint frequency, whether to overwrite existing output, and whether to save only the model weights.

Training Hyper‑parameters

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
num_train_epochs: 1.0
learning_rate: 1.0e-4
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

Effective batch size = per_device_train_batch_size × gradient_accumulation_steps (1 × 8 = 8). The guide explains how each parameter influences gradient estimation, learning‑rate scheduling, mixed‑precision training, and checkpoint recovery.

Validation Configuration (optional)

# eval_dataset: alpaca_en_demo
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

When enabled, a held‑out set is used to monitor generalisation and avoid over‑fitting.

LoRA Adapter Merging

After fine‑tuning, the adapter must be merged into the base model for inference. The merge configuration looks like:

model_name_or_path: /workspace/Qwen2_5_0_5
adapter_name_or_path: /workspace/test_sft/Qwen2_5_sft
template: qwen
trust_remote_code: true

export_dir: /workspace/test_sft/Qwen2_5_sft_all
export_size: 5
export_device: cpu
export_legacy_format: false

Key points: the base model path, the adapter path, the chat template, and the export options. export_device can be cpu (safer, slower) or auto (GPU if enough memory). export_legacy_format set to false produces a safetensors file; true yields the older pytorch_model.bin format.

Practical Tuning Guidelines

lora_rank : 4–8 for ≤7 B models, 8–16 for 13 B, lower (2–4) when memory is tight, higher (16–32) for very complex tasks.

lora_alpha : Often set equal to lora_rank (baseline). Increase to 2× rank if loss decreases slowly (under‑fitting); decrease if validation loss rises (over‑fitting). Keep the ratio α/r between 1 and 2.

learning_rate : Typical range 1e‑4 – 2e‑4 for LoRA. Reduce if loss oscillates or diverges; increase slightly if loss declines too slowly.

Batch size & gradient accumulation : Aim for an effective batch size ≥ 8. Reduce per_device_train_batch_size first when memory is limited, then compensate with larger gradient_accumulation_steps.

Warm‑up & scheduler : Use a 10 % warm‑up ratio and cosine decay to avoid early instability and to smooth the final convergence.

Summary

The guide provides a complete, step‑by‑step reference for configuring LLaMAFactory, selecting an appropriate fine‑tuning method, understanding each hyper‑parameter’s impact, and merging the resulting LoRA adapter into a standalone model ready for deployment.

LoRA QLoRA training hyperparameters large-model LLaMAFactory model-merge

Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.