What Scaling Laws Reveal About LLM Fine‑Tuning and RLHF Performance

This article reviews recent scaling‑law research on large‑language‑model fine‑tuning and RLHF, explaining how data quantity, model size, PET parameters, reward‑model size and KL‑penalty affect downstream performance and offering practical insights for efficient training.

NewBeeNLP
NewBeeNLP
NewBeeNLP
What Scaling Laws Reveal About LLM Fine‑Tuning and RLHF Performance

With the rapid development of large‑model technology, scaling laws have become a valuable tool for predicting performance, reducing experimental cost, and monitoring training progress. The article first outlines three main benefits of mastering scaling laws: forecasting final model quality, conducting low‑cost experiments on small models, and continuously checking large‑scale pre‑training outcomes.

Fine‑Tuning Scaling Law

The paper "When Scaling Meets LLM Finetuning – The Effect of Data, Model and Finetuning Method" (Google, ICLR 2024) studies how fine‑tuning data volume, model size, pre‑training data amount, and PET parameters (prompt‑tuning, LoRA) influence translation performance.

The authors model the relationship with a power‑law function, fitting parameters that reflect the importance of each factor.

Data Volume + Model Size

The fitted curve (solid line) and experimental points (dots) show that test perplexity consistently decreases as both data volume and model size increase, though the fit degrades for 16B models under PET, likely due to pre‑training issues.

Data Volume + Pre‑Training Data

Increasing pre‑training data also improves downstream fine‑tuning, but under the same compute budget a larger model fine‑tuned yields better results than simply adding more pre‑training data, especially for translation tasks with limited diversity requirements.

Data Volume + PET Parameter Count

Increasing PET parameters provides only marginal gains. LoRA proves more stable than prompt‑tuning, while prompt‑tuning can exhibit inverse scaling at larger data volumes.

Key Takeaways

Fine‑tuning follows a scaling law; large‑scale high‑quality data remains beneficial when downstream tasks are fixed.

Full‑parameter fine‑tuning (FMT) needs more data but can outperform PET; PET is preferable with limited data, with prompt‑tuning better for small data and LoRA for larger data.

PET’s small parameter changes lead to better generalisation on similar tasks compared to full‑parameter fine‑tuning.

For clearly defined downstream tasks, using a strong base model with a small amount of data and PET fine‑tuning is a wise choice, though the optimal method remains highly task‑ and data‑dependent.

RLHF Scaling Law

OpenAI’s early 2022 work (three authors, including PPO pioneer John Schulman) investigates scaling behavior in Reinforcement Learning from Human Feedback (RLHF), focusing on reward‑model (RM) size and KL‑divergence as key factors.

When a measure becomes a target, it ceases to be a good measure.

RLHF suffers from over‑optimization: the reward model may be biased, and the policy can exploit reward shortcuts, leading to Goodhart’s law effects.

The derived scaling law relates model performance to RM size and KL‑divergence, enabling two practical applications:

Predicting the training step at which the model reaches its highest true score based on KL deviation, improving early‑stopping decisions.

Estimating achievable performance for a given RM, or inferring required model size for a target score.

Experiments use two reward models (a gold RM and a proxy RM) to accelerate RL training. The authors choose KL‑divergence as the X‑axis because RL optimises both reward and KL penalty, making token count unsuitable.

When the KL penalty coefficient is too large, the policy stops updating, preventing reward improvement; thus KL penalty effectively acts as an early‑stopping mechanism, and the authors omit it to study step‑count effects.

Additional Influencing Factors

Increasing RM training data raises the gold score but does not produce a clear scaling pattern. Larger policy models perform better under the same RM, though over‑optimization can occur earlier for bigger models. The performance gap between proxy and gold RMs remains similar across model sizes.

The work provides concise, elegant conclusions and valuable experimental details for anyone studying RLHF scaling.

Overall Summary

Scaling laws are not only practical tools but also highlight the key factors—data quantity, model size, RM size, and KL‑penalty—that shape model performance across pre‑training, fine‑tuning, and alignment stages. Recent analyses, such as DeepSeek’s study of batch size and learning rate, further enrich our understanding, and future research is expected to expand these insights.

Artificial IntelligenceLLMscaling lawRLHF
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.