Mastering LLM Supervised Fine‑Tuning: Practical Tips, Data Strategies, and Debugging

This article provides a comprehensive, experience‑driven guide to supervised fine‑tuning (SFT) of large language models, covering special tokens, latency considerations, data diversity and production, training frameworks and hyper‑parameters, over‑/under‑fitting diagnostics, and evaluation metrics such as helpfulness, honesty, and harmlessness.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Mastering LLM Supervised Fine‑Tuning: Practical Tips, Data Strategies, and Debugging

Background

SFT differs from pre‑training mainly in data composition: it uses variable‑length examples, introduces new special tokens (e.g., <user>, <assistant>, <system>) to label roles, and focuses on teaching instruction‑following rather than pure knowledge acquisition.

Latency is roughly b + k·x where b is the first‑token cost (often ten times larger than per‑token cost k) and x is the number of generated tokens; prompt length therefore has a direct impact on inference speed.

Data Chapter

Data work accounts for about 90 % of an SFT engineer’s effort. The core principles are:

Prioritize data diversity (different task_type and data form ) over sheer quantity.

Ensure each example has a clear task_type label (e.g., "logic‑reasoning – commonsense", "translation").

Vary prompt expression, length, and answer length to prevent the model from memorizing positional patterns.

Include noisy or malformed prompts (robustness data) so the model can handle real‑world user input.

Prompt generation can reuse seed prompts from works like Self‑Instruct or be handcrafted; answers are often produced with strong models such as GPT‑4, then filtered for quality. For cost‑effective pipelines, a small model can be trained on a modest GPT‑4‑generated dataset and later used to generate larger volumes of data.

Training Chapter

Both DeepSpeed and Megatron work for SFT; DeepSpeed is convenient because many alignment libraries already target it and it integrates smoothly with AutoModelForCausalLM. Key hyper‑parameters to monitor include:

epoch, gradient_accumulation_steps, learning_rate, lr_scheduler_type, dropout

zero_stage, max_seq_len, offload, gradient_checkpointing, seq_parallel_size

weight_decay, per_device_train_batch_size, num_warmup_steps

Practical “alchemy” tips:

Small models need higher learning rates; large models need lower rates.

Typical data size: 10 k–100 k examples (≈1 % of pre‑training data).

Warm‑up the learning rate, try several scheduler types, and experiment with gradient accumulation values (16, 32, 64, 128).

Loss monitoring:

Track per‑task_type loss separately.

Special‑token loss drops quickly after a few steps.

Creative tasks often have higher loss than factual ones.

Typical loss values: 7B model ~2, 13B ~2, 72B ~1‑2; final loss around 0.5.

If loss rises steadily, suspect code bugs rather than data difficulty.

Over‑fitting is not always harmful; format over‑fitting (learning to output JSON, correct EOS) is desirable, while content over‑fitting (repeating the same answer) should be mitigated by diversifying data or adjusting task weighting.

Fit‑ness Issues

Under‑fitting manifests as the model failing to answer even training examples; solutions include more epochs, higher learning rates, or fixing data pipelines.

Over‑fitting appears as the model rigidly reproducing patterns; address it by pruning or augmenting the offending task_type data rather than tweaking generic hyper‑parameters.

The “sandwich rule” (learning‑rate bound) can help locate a learning rate that preserves base model capabilities while improving SFT performance.

Evaluation Chapter

Evaluation must use a high‑quality test set mirroring the training task types. Unlike pre‑training, SFT assessment follows the 3H principle: Helpfulness, Honesty, Harmlessness , or any custom metrics relevant to the product.

Two main evaluation modalities:

Automated scoring with LLMs (e.g., GPT‑4) requires carefully crafted prompts to avoid bias toward longer answers.

Human evaluation remains the gold standard; experienced evaluators can quickly spot systematic failures by memorizing common prompts.

Always compare new model versions against previous baselines to pinpoint which task_type or data change caused score shifts.

Conclusion

SFT is conceptually simple but demands disciplined data engineering, systematic debugging, and continuous evaluation. Mastery comes from understanding base model limits, iterating on data pipelines, and developing a feel for training dynamics rather than relying on heavy‑weight model architecture changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data engineeringAILLMSFTTraining
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.