Artificial Intelligence 7 min read

Beyond Cosine Decay: Fixed LR + Quick Decay Beats Traditional Schedules in LLM Training

The article analyzes why the traditional cosine decay learning‑rate schedule hinders continued training of large language models and shows that fixed‑learning‑rate strategies such as Warmup‑Stable‑Decay, Cooldown, SWA, and Schedule‑Free Optimizer can match or surpass cosine performance while being more friendly to fine‑tuning.

NewBeeNLP

Jun 12, 2024

Beyond Cosine Decay: Fixed LR + Quick Decay Beats Traditional Schedules in LLM Training

Why Cosine Decay Is Problematic

In pre‑training large language models, batch size and learning rate are critical hyper‑parameters. Although most recent LLMs use cosine decay, this schedule ties the decay period to the total training steps, making it unfriendly for continued training: after pre‑training the learning rate becomes very low, so restarting with a large learning rate harms performance, while a too‑small learning rate slows convergence.

Warmup‑Stable‑Decay (WSD) – A Simple Alternative

MiniCPM introduced the WSD strategy: a rapid warm‑up, a long phase with a constant learning rate, and a quick decay to a small learning rate at the end. Experiments on small‑scale models show that WSD converges faster and can even outperform cosine decay, especially when the final 10% of steps are used for rapid decay.

Cooldown: Fixed LR with Short Decay

Further scaling‑law analysis of fixed‑learning‑rate schedules (named Cooldown) found that the optimal learning rate is about half of the cosine optimum, and that a decay length of 10‑20% of total steps yields performance comparable to or better than cosine decay. When training tokens increase from 5B to 20B, a 5% decay can still match cosine.

Stochastic Weight Averaging (SWA)

Proposed in 2018, SWA averages multiple checkpoints, allowing a completely fixed learning rate. However, selecting the averaging window is non‑trivial, and empirical results show SWA often underperforms Cooldown or cosine decay on small models.

Schedule‑Free Optimizer (SFO)

Meta’s recent SFO modifies the optimizer to work with a fixed learning rate, inspired by Polyak‑Ruppert averaging. While SFO can achieve stability comparable to cosine decay, it is sensitive to the AdamW beta hyper‑parameter and generally performs slightly worse than Cooldown.

Practical Takeaways

Switching to WSD or Cooldown requires tuning the decay length (typically 10‑20% of total steps).

Adopting SWA demands careful selection of the averaging window.

Using SFO involves adjusting the optimizer’s beta parameter.

All three newer strategies have been validated only on small‑scale models; their effectiveness on large‑scale LLMs and adherence to scaling laws remain open questions.

References

DeepSeek LLM: Scaling Open‑Source Language Models with Longtermism – https://arxiv.org/abs/2401.02954

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies – https://arxiv.org/abs/2404.06395

Scaling Laws and Compute‑Optimal Training Beyond Fixed Training Durations – https://arxiv.org/abs/2405.18392

Averaging Weights Leads to Wider Optima and Better Generalization – https://arxiv.org/abs/1803.05407v3

The Road Less Scheduled – https://arxiv.org/abs/2405.15682

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM training learning rate schedule Cooldown cosine decay SFO SWA WSD

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.