Practical Tips for CPT, SFT, and LoRA in Large Language Model Fine‑Tuning
This article shares hands‑on guidance on using continual pre‑training (CPT), supervised fine‑tuning (SFT), and LoRA adapters for large language models, covering dataset size requirements, learning‑rate scheduling, warm‑up ratios, epoch strategies, and practical routing choices based on real‑world experiments.
About CPT (Continual Pre‑Training)
The knowledge of a pre‑trained LLM originates from its initial PT phase; to inject new knowledge, CPT is a viable option provided you have a sufficiently large dataset (several billion tokens). If the dataset is small (e.g., only tens of thousands of examples), full‑parameter fine‑tuning is usually more effective.
Key hyper‑parameters:
Learning rate (LR) is critical. A too‑large LR hampers convergence and erodes existing capabilities; a too‑small LR may fail to learn new knowledge.
For datasets under ~100 B tokens, use a LR about 10 % of the maximum PT LR (e.g., 3e‑5 for a 7B model whose PT LR is ~3e‑4).
Scale LR with batch size: LR ∝ sqrt(batch‑size). Doubling the batch size by 4× allows roughly a 2× LR increase.
Warm‑up ratio matters. Typical LLM training uses ~1 % of an epoch as warm‑up. For SFT, 3 % is common; for CPT, a slightly larger ratio (e.g., 2–3 %) can smooth the transition.
Sometimes an entire first epoch can be dedicated to warm‑up, which has been shown to work in practice (e.g., Qwen‑7B report).
About SFT (Supervised Fine‑Tuning)
Do not blindly trust the “three‑epoch” rule; a single epoch can already yield usable dialogue performance, though more epochs often improve evaluation scores. When resources are limited, a single pass may suffice, especially if starting from an existing SFT checkpoint (e.g., ChatGLM).
If the dataset is tiny (≈1 k examples), increasing epochs can help mitigate over‑fitting, but the benefit is limited.
Technical Routing Options for Domain‑Specific LLMs
When building a domain‑specific model that diverges significantly from a general‑purpose chatbot, several pipelines are possible:
Start from a PT model, perform CPT, then SFT on domain data only. Result: Model loses general dialogue ability – not recommended unless general ability is irrelevant.
Start from a PT model, perform CPT, then SFT on a mixture of generic SFT data + domain data. Result: Works well when domain tasks are similar to generic tasks (e.g., medical QA).
Same as (2) but with a large domain‑specific format mismatch; generic SFT answers may negatively affect the target task.
PT → CPT → generic SFT → domain SFT. Result: Introduces a gap between knowledge injection and final fine‑tuning, potentially sub‑optimal.
Start from an existing SFT model, then CPT, then domain SFT. Result: Similar gap issue as (4).
Because none of the conventional pipelines fully satisfy all requirements, hybrid approaches are explored:
Combine CPT with a large amount of SFT dialogue data (masked questions) for one epoch.
Follow with a mixed epoch of high‑quality generic SFT data plus domain data.
Finish with a final epoch of pure domain data.
Adjust epoch counts per stage based on dataset size and importance (e.g., CPT 1 epoch, generic SFT 2 epochs, domain SFT 2 epochs).
These mixed strategies aim to balance knowledge injection, retention of general abilities, and domain specialization without excessive computational cost.
About LoRA
LoRA is treated as a capability plug‑in rather than a full‑parameter solution. Observations include:
GPU memory savings are noticeable, but overall training time may not decrease significantly.
LoRA excels at learning new output formats or domain topics but is less effective for injecting entirely new knowledge.
Typical workflow: PT → full‑parameter SFT → LoRA fine‑tuning.
For small datasets, use a low lora_rank (e.g., 8–32); for larger datasets, increase the rank to expand the side‑matrix capacity.
Applying LoRA to all linear projections (q_proj, k_proj, v_proj, up_proj, down_proj, etc.) is recommended. lora_alpha scales the side‑matrix impact and is usually proportional to lora_rank. Setting alpha ≈ 2 × rank is a common baseline, but a 1:1 ratio should be tested.
Community Additions
Additional observations from other practitioners:
In some models (e.g., ChatGLM), SFT and PT tasks differ, and mixing PT data into SFT does not always prevent catastrophic forgetting.
For very small datasets, a well‑designed prefix can yield significant gains even after >100 epochs.
LoRA’s memory savings are clear, but its interaction with zero‑redundancy optimizers can be counter‑productive.
Empirical results show LoRA can lag behind full‑parameter fine‑tuning by >4 % on certain metrics and is sensitive to layer‑wise learning‑rate schedules.
Rank values between 1–16 produce noticeable differences; 16–32 is generally a safe range, while very high ranks (e.g., 128) mainly affect convergence speed.
DPO and RLHF have different data‑quality requirements; when GPU memory is limited, sharing actors/critics via LoRA can halve memory usage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
