Mastering LLM Supervised Fine‑Tuning: Practical Tips, Data Strategies, and Debugging
This article provides a comprehensive, experience‑driven guide to supervised fine‑tuning (SFT) of large language models, covering special tokens, latency considerations, data diversity and production, training frameworks and hyper‑parameters, over‑/under‑fitting diagnostics, and evaluation metrics such as helpfulness, honesty, and harmlessness.
Background
SFT differs from pre‑training mainly in data composition: it uses variable‑length examples, introduces new special tokens (e.g., <user>, <assistant>, <system>) to label roles, and focuses on teaching instruction‑following rather than pure knowledge acquisition.
Latency is roughly b + k·x where b is the first‑token cost (often ten times larger than per‑token cost k) and x is the number of generated tokens; prompt length therefore has a direct impact on inference speed.
Data Chapter
Data work accounts for about 90 % of an SFT engineer’s effort. The core principles are:
Prioritize data diversity (different task_type and data form ) over sheer quantity.
Ensure each example has a clear task_type label (e.g., "logic‑reasoning – commonsense", "translation").
Vary prompt expression, length, and answer length to prevent the model from memorizing positional patterns.
Include noisy or malformed prompts (robustness data) so the model can handle real‑world user input.
Prompt generation can reuse seed prompts from works like Self‑Instruct or be handcrafted; answers are often produced with strong models such as GPT‑4, then filtered for quality. For cost‑effective pipelines, a small model can be trained on a modest GPT‑4‑generated dataset and later used to generate larger volumes of data.
Training Chapter
Both DeepSpeed and Megatron work for SFT; DeepSpeed is convenient because many alignment libraries already target it and it integrates smoothly with AutoModelForCausalLM. Key hyper‑parameters to monitor include:
epoch, gradient_accumulation_steps, learning_rate, lr_scheduler_type, dropout
zero_stage, max_seq_len, offload, gradient_checkpointing, seq_parallel_size
weight_decay, per_device_train_batch_size, num_warmup_steps
Practical “alchemy” tips:
Small models need higher learning rates; large models need lower rates.
Typical data size: 10 k–100 k examples (≈1 % of pre‑training data).
Warm‑up the learning rate, try several scheduler types, and experiment with gradient accumulation values (16, 32, 64, 128).
Loss monitoring:
Track per‑task_type loss separately.
Special‑token loss drops quickly after a few steps.
Creative tasks often have higher loss than factual ones.
Typical loss values: 7B model ~2, 13B ~2, 72B ~1‑2; final loss around 0.5.
If loss rises steadily, suspect code bugs rather than data difficulty.
Over‑fitting is not always harmful; format over‑fitting (learning to output JSON, correct EOS) is desirable, while content over‑fitting (repeating the same answer) should be mitigated by diversifying data or adjusting task weighting.
Fit‑ness Issues
Under‑fitting manifests as the model failing to answer even training examples; solutions include more epochs, higher learning rates, or fixing data pipelines.
Over‑fitting appears as the model rigidly reproducing patterns; address it by pruning or augmenting the offending task_type data rather than tweaking generic hyper‑parameters.
The “sandwich rule” (learning‑rate bound) can help locate a learning rate that preserves base model capabilities while improving SFT performance.
Evaluation Chapter
Evaluation must use a high‑quality test set mirroring the training task types. Unlike pre‑training, SFT assessment follows the 3H principle: Helpfulness, Honesty, Harmlessness , or any custom metrics relevant to the product.
Two main evaluation modalities:
Automated scoring with LLMs (e.g., GPT‑4) requires carefully crafted prompts to avoid bias toward longer answers.
Human evaluation remains the gold standard; experienced evaluators can quickly spot systematic failures by memorizing common prompts.
Always compare new model versions against previous baselines to pinpoint which task_type or data change caused score shifts.
Conclusion
SFT is conceptually simple but demands disciplined data engineering, systematic debugging, and continuous evaluation. Mastery comes from understanding base model limits, iterating on data pipelines, and developing a feel for training dynamics rather than relying on heavy‑weight model architecture changes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
