Beyond Cosine Decay: Fixed LR + Quick Decay Beats Traditional Schedules in LLM Training
The article analyzes why the traditional cosine decay learning‑rate schedule hinders continued training of large language models and shows that fixed‑learning‑rate strategies such as Warmup‑Stable‑Decay, Cooldown, SWA, and Schedule‑Free Optimizer can match or surpass cosine performance while being more friendly to fine‑tuning.
Why Cosine Decay Is Problematic
In pre‑training large language models, batch size and learning rate are critical hyper‑parameters. Although most recent LLMs use cosine decay, this schedule ties the decay period to the total training steps, making it unfriendly for continued training: after pre‑training the learning rate becomes very low, so restarting with a large learning rate harms performance, while a too‑small learning rate slows convergence.
Warmup‑Stable‑Decay (WSD) – A Simple Alternative
MiniCPM introduced the WSD strategy: a rapid warm‑up, a long phase with a constant learning rate, and a quick decay to a small learning rate at the end. Experiments on small‑scale models show that WSD converges faster and can even outperform cosine decay, especially when the final 10% of steps are used for rapid decay.
Cooldown: Fixed LR with Short Decay
Further scaling‑law analysis of fixed‑learning‑rate schedules (named Cooldown) found that the optimal learning rate is about half of the cosine optimum, and that a decay length of 10‑20% of total steps yields performance comparable to or better than cosine decay. When training tokens increase from 5B to 20B, a 5% decay can still match cosine.
Stochastic Weight Averaging (SWA)
Proposed in 2018, SWA averages multiple checkpoints, allowing a completely fixed learning rate. However, selecting the averaging window is non‑trivial, and empirical results show SWA often underperforms Cooldown or cosine decay on small models.
Schedule‑Free Optimizer (SFO)
Meta’s recent SFO modifies the optimizer to work with a fixed learning rate, inspired by Polyak‑Ruppert averaging. While SFO can achieve stability comparable to cosine decay, it is sensitive to the AdamW beta hyper‑parameter and generally performs slightly worse than Cooldown.
Practical Takeaways
Switching to WSD or Cooldown requires tuning the decay length (typically 10‑20% of total steps).
Adopting SWA demands careful selection of the averaging window.
Using SFO involves adjusting the optimizer’s beta parameter.
All three newer strategies have been validated only on small‑scale models; their effectiveness on large‑scale LLMs and adherence to scaling laws remain open questions.
References
DeepSeek LLM: Scaling Open‑Source Language Models with Longtermism – https://arxiv.org/abs/2401.02954
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies – https://arxiv.org/abs/2404.06395
Scaling Laws and Compute‑Optimal Training Beyond Fixed Training Durations – https://arxiv.org/abs/2405.18392
Averaging Weights Leads to Wider Optima and Better Generalization – https://arxiv.org/abs/1803.05407v3
The Road Less Scheduled – https://arxiv.org/abs/2405.15682
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
