Mastering Large Model Training: Practical Parameter Tuning from Beginner to Pro

This guide walks you through interpreting training logs and loss curves, diagnosing common issues such as oscillation, under‑fitting, and over‑fitting, and applying concrete adjustments to learning rate, LoRA settings, batch size, and epochs, with scenario‑specific strategies to turn a novice into a tuning expert.

Fun with Large Models
Fun with Large Models
Fun with Large Models
Mastering Large Model Training: Practical Parameter Tuning from Beginner to Pro

1. Reading training logs and loss curves

The training log trainer_log.jsonl contains the fields loss, learning_rate, epoch and step. Plotting loss versus step reveals the model’s health.

1.1 Ideal loss curve

Overall descending trend from a high initial value.

Fast decline early, then gradual flattening.

Smooth without large spikes.

Eventually reaches a stable plateau.

1.2 Abnormal loss signals

Training loss should continuously drop then stabilize; a rise or wild spikes indicate a problem.

Validation loss should track training loss; if training loss drops while validation loss rises, the signal is abnormal.

Loss gap (training vs. validation) should be small; a gap larger than 0.5 or validation loss higher than training loss is abnormal.

2. Typical training problems and solutions

2.1 Loss oscillation – step too large

Reduce learning rate (e.g., from 1e-4 to 5e-5 or 2e-5).

Increase gradient_accumulation_steps to stabilise gradient estimates.

Raise warmup_ratio so the learning rate ramps up slowly.

# before
learning_rate: 1.0e-4
gradient_accumulation_steps: 8
warmup_ratio: 0.1

# after
learning_rate: 5.0e-5  # reduced
gradient_accumulation_steps: 16  # increased
warmup_ratio: 0.2  # longer warmup

2.2 Under‑fitting – model not learning

Increase learning rate (e.g., to 2.0e-4).

Raise num_train_epochs for more passes.

Increase lora_rank or lora_alpha to give the adapter more capacity.

Verify data quality and correct labeling.

# before
learning_rate: 5.0e-5
num_train_epochs: 1.0
lora_rank: 4
lora_alpha: 8

# after
learning_rate: 2.0e-4  # higher
num_train_epochs: 3.0  # more epochs
lora_rank: 8  # larger rank
lora_alpha: 16  # larger alpha

2.3 Over‑fitting – validation loss rises

Decrease lora_alpha to weaken adapter impact.

Apply early stopping based on validation loss.

Increase validation set size (e.g., val_size from 0.1 to 0.2).

Optionally lower learning rate for finer updates.

# before
lora_alpha: 16
num_train_epochs: 5.0
val_size: 0.1

# after
lora_alpha: 8   # reduced
num_train_epochs: 2.0  # fewer epochs
val_size: 0.2   # larger validation set

2.4 Loss stagnates – training “stuck”

Check data pipeline (template‑model compatibility, correct field mapping).

Temporarily raise learning rate 5‑10× to see if loss moves.

Verify the learning‑rate scheduler is sensible (e.g., cosine or linear).

Run a quick sanity test on a tiny subset (100‑200 samples, 3‑5 epochs).

3. Core parameter tuning methodology

3.1 Learning rate

LoRA training typically works in the 5e-5 ~ 2e-4 range. Start low (e.g., 1e-4) and adjust based on loss behaviour:

Severe oscillation → lower LR.

Slow decline or plateau → raise LR.

Otherwise keep or fine‑tune.

Learning‑rate schedules:

Constant – simple but inflexible.

Step decay – multiply by 0.9 every 1k steps.

Linear warmup + cosine decay – recommended for stable start and smooth finish.

3.2 LoRA parameters

The two key knobs are lora_rank and lora_alpha. Their ratio ( alpha/rank) controls adapter strength. Typical scenarios:

Simple task / little data: rank 4‑8, alpha 8‑16, ratio 2:1.

Regular task: rank 8‑16, alpha 8‑16, ratio 2:1.

Complex / large data: rank 16‑32, alpha 16‑32, ratio 1:1.

Memory‑tight: rank 2‑4, alpha 4‑8, ratio 2:1.

Start with rank = 8, alpha = 16 (ratio 2:1) for typical tasks, then adjust up for more capacity or down to curb over‑fitting.

3.3 Batch‑related parameters

Effective batch size = per_device_train_batch_size × gradient_accumulation_steps. Recommended steps:

Maximise per_device_train_batch_size within GPU memory (1, 2, 4, 8 …).

Use gradient_accumulation_steps to reach a target effective batch (≥ 8, ideally 16‑32).

Suggested configurations per GPU memory:

8 GB: batch 1, grad‑acc 8‑16 → effective 8‑16.

16 GB: batch 1‑2, grad‑acc 8‑16 → effective 8‑32.

24 GB: batch 2‑4, grad‑acc 4‑8 → effective 8‑32.

40 GB+: batch 4‑8, grad‑acc 2‑4 → effective 8‑32.

3.4 Training epochs

Use validation loss trend to decide when to stop:

Start with 1‑2 epochs.

If validation loss stops decreasing for several epochs, apply early stopping.

Under‑fitting → increase epochs; over‑fitting → decrease.

Set a high upper bound (e.g., 10) and enable an early‑stopping callback (e.g., stop after 5 consecutive non‑improving epochs).

4. Scenario‑specific tuning strategies

4.1 Incremental pre‑training (knowledge acquisition)

Goal: add domain knowledge while preserving existing abilities. Use conservative settings:

# incremental pre‑training config
stage: pt
lora_rank: 8~16   # moderate rank
lora_alpha: 8~16   # keep 1:1 ratio
learning_rate: 5.0e-5~1.0e-4   # small LR
num_train_epochs: 1~2   # few epochs to avoid forgetting
warmup_ratio: 0.05~0.1   # short warmup
cutoff_len: 2048~4096   # longer context

4.2 Instruction fine‑tuning (format learning)

Goal: teach the model a specific dialogue or task format. More aggressive settings are safe:

# instruction fine‑tuning config
stage: sft
lora_rank: 8~32   # adjust to task complexity
lora_alpha: 8~32   # keep 1:1 or 2:1
learning_rate: 1.0e-4~2.0e-4   # larger LR
num_train_epochs: 2~5   # based on data size
warmup_ratio: 0.1~0.2   # longer warmup for format adaptation
cutoff_len: 1024~2048   # match instruction length

4.3 Choosing a training strategy by data volume

Large domain data (> 1 M) + instruction data → pre‑training then SFT.

Small domain data (< 10 k) but instruction data → direct SFT.

Domain data only → PT first, then fine‑tune on a generic instruction set (e.g., Alpaca) to restore instruction ability.

4.4 Resource‑constrained training (< 12 GB VRAM)

Prioritise feasibility:

# low‑memory config
lora_rank: 4
per_device_train_batch_size: 1
gradient_accumulation_steps: 16   # compensate batch size
cutoff_len: 1024   # shorter sequences
# optional: enable QLoRA to further cut memory

4.5 Pursuing maximum performance

When cost is not a concern, push all knobs:

# high‑performance config
lora_rank: 32~64
lora_alpha: 32~64
learning_rate: 2.0e-4
num_train_epochs: 3~5
lr_scheduler_type: cosine
warmup_ratio: 0.1

Run systematic learning‑rate sweeps and compare multiple lora_target selections (e.g., only q_proj and v_proj) to find the optimal combination.

5. Final takeaways

Parameter tuning is a hands‑on skill that relies on reading loss curves, recognizing characteristic patterns, and applying targeted adjustments to learning rate, LoRA settings, batch size, and epochs. There is no universal “best” configuration; the optimal settings depend on data scale, hardware limits, and the specific training stage (knowledge acquisition vs. format learning). By following the diagnostic process and scenario‑specific recipes above, practitioners can move from blind trial‑and‑error to systematic, evidence‑based tuning.

LoRAParameter Tuninglarge modelAI traininghyperparameterstraining loss
Fun with Large Models
Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.