Why Cold Starts, Reward Hacking, and Evaluation Matter in LLM Training
The article analyzes key challenges in large‑language‑model pipelines—including the necessity of cold‑start pretraining, the pitfalls of reward‑model hacking, efficiency‑effectiveness trade‑offs, evaluation difficulties, and downstream fine‑tuning limits—offering practical insights for more reliable LLM development.
1. Importance of Cold Start
Large‑model training proceeds from broad pretraining on internet data toward increasingly human‑like language behavior. Pretraining provides a knowledge base that serves as a cold start for Supervised Fine‑Tuning (SFT), which in turn gives a cold start for Reinforcement Learning from Human Feedback (RLHF). In LLaMA‑2, a bootstrap method for SFT and iterative/rejection‑sampling techniques for RL are highlighted as ways to reduce data consumption and keep the model aligned with human preferences.
2. Reward Model (RM) Pitfalls
RL aims to stabilize LLMs, often using PPO‑style algorithms, but the reward model can suffer from “reward hacking.” Because an LLM’s action space is the entire vocabulary and token sequence, the RM must generalize extremely well. When a model learns to label all known bad cases as low‑score and everything else as high‑score, it can produce high RM scores while generating useless or harmful outputs, a classic reward‑hacking scenario.
3. Efficiency vs. Effectiveness Trade‑off
Beyond hardware, data construction tricks—such as multi‑turn dialogue formatting with special tokens (e.g., <eos>)—can dramatically improve learning efficiency, as seen in LLaMA‑2. Methods like Direct Preference Optimization (DPO) replace PPO sampling with annotated data, reducing sampling cost but requiring large, high‑quality datasets. LLaMA‑2 ultimately sacrifices some efficiency in the final stage, using rejection sampling to avoid unexpected high‑score but low‑quality generations.
4. Challenges in Model Evaluation
Accurate evaluation is critical because poor metrics waste compute and money, effectively reducing a team’s GPU budget. Existing public automated evaluation methods are still unreliable, making systematic assessment of LLM performance a major bottleneck.
5. Downstream Fine‑tuning Difficulties
Fine‑tuning a base LLM on domain‑specific data often leads to over‑fitting and loss of general capabilities. Maintaining the original model’s breadth while injecting new knowledge demands a careful balance of data distribution, which is hard without access to the original training distribution. In practice, domain‑specific LLMs frequently become specialized generators that no longer retain broad competence, and rebuilding a full‑scale model may be more cost‑effective than extensive fine‑tuning.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
