Mastering Large Model Training: Practical Parameter Tuning from Beginner to Pro
This guide walks you through interpreting training logs and loss curves, diagnosing common issues such as oscillation, under‑fitting, and over‑fitting, and applying concrete adjustments to learning rate, LoRA settings, batch size, and epochs, with scenario‑specific strategies to turn a novice into a tuning expert.
1. Reading training logs and loss curves
The training log trainer_log.jsonl contains the fields loss, learning_rate, epoch and step. Plotting loss versus step reveals the model’s health.
1.1 Ideal loss curve
Overall descending trend from a high initial value.
Fast decline early, then gradual flattening.
Smooth without large spikes.
Eventually reaches a stable plateau.
1.2 Abnormal loss signals
Training loss should continuously drop then stabilize; a rise or wild spikes indicate a problem.
Validation loss should track training loss; if training loss drops while validation loss rises, the signal is abnormal.
Loss gap (training vs. validation) should be small; a gap larger than 0.5 or validation loss higher than training loss is abnormal.
2. Typical training problems and solutions
2.1 Loss oscillation – step too large
Reduce learning rate (e.g., from 1e-4 to 5e-5 or 2e-5).
Increase gradient_accumulation_steps to stabilise gradient estimates.
Raise warmup_ratio so the learning rate ramps up slowly.
# before
learning_rate: 1.0e-4
gradient_accumulation_steps: 8
warmup_ratio: 0.1
# after
learning_rate: 5.0e-5 # reduced
gradient_accumulation_steps: 16 # increased
warmup_ratio: 0.2 # longer warmup2.2 Under‑fitting – model not learning
Increase learning rate (e.g., to 2.0e-4).
Raise num_train_epochs for more passes.
Increase lora_rank or lora_alpha to give the adapter more capacity.
Verify data quality and correct labeling.
# before
learning_rate: 5.0e-5
num_train_epochs: 1.0
lora_rank: 4
lora_alpha: 8
# after
learning_rate: 2.0e-4 # higher
num_train_epochs: 3.0 # more epochs
lora_rank: 8 # larger rank
lora_alpha: 16 # larger alpha2.3 Over‑fitting – validation loss rises
Decrease lora_alpha to weaken adapter impact.
Apply early stopping based on validation loss.
Increase validation set size (e.g., val_size from 0.1 to 0.2).
Optionally lower learning rate for finer updates.
# before
lora_alpha: 16
num_train_epochs: 5.0
val_size: 0.1
# after
lora_alpha: 8 # reduced
num_train_epochs: 2.0 # fewer epochs
val_size: 0.2 # larger validation set2.4 Loss stagnates – training “stuck”
Check data pipeline (template‑model compatibility, correct field mapping).
Temporarily raise learning rate 5‑10× to see if loss moves.
Verify the learning‑rate scheduler is sensible (e.g., cosine or linear).
Run a quick sanity test on a tiny subset (100‑200 samples, 3‑5 epochs).
3. Core parameter tuning methodology
3.1 Learning rate
LoRA training typically works in the 5e-5 ~ 2e-4 range. Start low (e.g., 1e-4) and adjust based on loss behaviour:
Severe oscillation → lower LR.
Slow decline or plateau → raise LR.
Otherwise keep or fine‑tune.
Learning‑rate schedules:
Constant – simple but inflexible.
Step decay – multiply by 0.9 every 1k steps.
Linear warmup + cosine decay – recommended for stable start and smooth finish.
3.2 LoRA parameters
The two key knobs are lora_rank and lora_alpha. Their ratio ( alpha/rank) controls adapter strength. Typical scenarios:
Simple task / little data: rank 4‑8, alpha 8‑16, ratio 2:1.
Regular task: rank 8‑16, alpha 8‑16, ratio 2:1.
Complex / large data: rank 16‑32, alpha 16‑32, ratio 1:1.
Memory‑tight: rank 2‑4, alpha 4‑8, ratio 2:1.
Start with rank = 8, alpha = 16 (ratio 2:1) for typical tasks, then adjust up for more capacity or down to curb over‑fitting.
3.3 Batch‑related parameters
Effective batch size = per_device_train_batch_size × gradient_accumulation_steps. Recommended steps:
Maximise per_device_train_batch_size within GPU memory (1, 2, 4, 8 …).
Use gradient_accumulation_steps to reach a target effective batch (≥ 8, ideally 16‑32).
Suggested configurations per GPU memory:
8 GB: batch 1, grad‑acc 8‑16 → effective 8‑16.
16 GB: batch 1‑2, grad‑acc 8‑16 → effective 8‑32.
24 GB: batch 2‑4, grad‑acc 4‑8 → effective 8‑32.
40 GB+: batch 4‑8, grad‑acc 2‑4 → effective 8‑32.
3.4 Training epochs
Use validation loss trend to decide when to stop:
Start with 1‑2 epochs.
If validation loss stops decreasing for several epochs, apply early stopping.
Under‑fitting → increase epochs; over‑fitting → decrease.
Set a high upper bound (e.g., 10) and enable an early‑stopping callback (e.g., stop after 5 consecutive non‑improving epochs).
4. Scenario‑specific tuning strategies
4.1 Incremental pre‑training (knowledge acquisition)
Goal: add domain knowledge while preserving existing abilities. Use conservative settings:
# incremental pre‑training config
stage: pt
lora_rank: 8~16 # moderate rank
lora_alpha: 8~16 # keep 1:1 ratio
learning_rate: 5.0e-5~1.0e-4 # small LR
num_train_epochs: 1~2 # few epochs to avoid forgetting
warmup_ratio: 0.05~0.1 # short warmup
cutoff_len: 2048~4096 # longer context4.2 Instruction fine‑tuning (format learning)
Goal: teach the model a specific dialogue or task format. More aggressive settings are safe:
# instruction fine‑tuning config
stage: sft
lora_rank: 8~32 # adjust to task complexity
lora_alpha: 8~32 # keep 1:1 or 2:1
learning_rate: 1.0e-4~2.0e-4 # larger LR
num_train_epochs: 2~5 # based on data size
warmup_ratio: 0.1~0.2 # longer warmup for format adaptation
cutoff_len: 1024~2048 # match instruction length4.3 Choosing a training strategy by data volume
Large domain data (> 1 M) + instruction data → pre‑training then SFT.
Small domain data (< 10 k) but instruction data → direct SFT.
Domain data only → PT first, then fine‑tune on a generic instruction set (e.g., Alpaca) to restore instruction ability.
4.4 Resource‑constrained training (< 12 GB VRAM)
Prioritise feasibility:
# low‑memory config
lora_rank: 4
per_device_train_batch_size: 1
gradient_accumulation_steps: 16 # compensate batch size
cutoff_len: 1024 # shorter sequences
# optional: enable QLoRA to further cut memory4.5 Pursuing maximum performance
When cost is not a concern, push all knobs:
# high‑performance config
lora_rank: 32~64
lora_alpha: 32~64
learning_rate: 2.0e-4
num_train_epochs: 3~5
lr_scheduler_type: cosine
warmup_ratio: 0.1Run systematic learning‑rate sweeps and compare multiple lora_target selections (e.g., only q_proj and v_proj) to find the optimal combination.
5. Final takeaways
Parameter tuning is a hands‑on skill that relies on reading loss curves, recognizing characteristic patterns, and applying targeted adjustments to learning rate, LoRA settings, batch size, and epochs. There is no universal “best” configuration; the optimal settings depend on data scale, hardware limits, and the specific training stage (knowledge acquisition vs. format learning). By following the diagnostic process and scenario‑specific recipes above, practitioners can move from blind trial‑and‑error to systematic, evidence‑based tuning.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
