Why Cold Starts, Reward Hacking, and Evaluation Matter in LLM Training

The article analyzes key challenges in large‑language‑model pipelines—including the necessity of cold‑start pretraining, the pitfalls of reward‑model hacking, efficiency‑effectiveness trade‑offs, evaluation difficulties, and downstream fine‑tuning limits—offering practical insights for more reliable LLM development.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why Cold Starts, Reward Hacking, and Evaluation Matter in LLM Training

1. Importance of Cold Start

Large‑model training proceeds from broad pretraining on internet data toward increasingly human‑like language behavior. Pretraining provides a knowledge base that serves as a cold start for Supervised Fine‑Tuning (SFT), which in turn gives a cold start for Reinforcement Learning from Human Feedback (RLHF). In LLaMA‑2, a bootstrap method for SFT and iterative/rejection‑sampling techniques for RL are highlighted as ways to reduce data consumption and keep the model aligned with human preferences.

2. Reward Model (RM) Pitfalls

RL aims to stabilize LLMs, often using PPO‑style algorithms, but the reward model can suffer from “reward hacking.” Because an LLM’s action space is the entire vocabulary and token sequence, the RM must generalize extremely well. When a model learns to label all known bad cases as low‑score and everything else as high‑score, it can produce high RM scores while generating useless or harmful outputs, a classic reward‑hacking scenario.

3. Efficiency vs. Effectiveness Trade‑off

Beyond hardware, data construction tricks—such as multi‑turn dialogue formatting with special tokens (e.g., <eos>)—can dramatically improve learning efficiency, as seen in LLaMA‑2. Methods like Direct Preference Optimization (DPO) replace PPO sampling with annotated data, reducing sampling cost but requiring large, high‑quality datasets. LLaMA‑2 ultimately sacrifices some efficiency in the final stage, using rejection sampling to avoid unexpected high‑score but low‑quality generations.

4. Challenges in Model Evaluation

Accurate evaluation is critical because poor metrics waste compute and money, effectively reducing a team’s GPU budget. Existing public automated evaluation methods are still unreliable, making systematic assessment of LLM performance a major bottleneck.

5. Downstream Fine‑tuning Difficulties

Fine‑tuning a base LLM on domain‑specific data often leads to over‑fitting and loss of general capabilities. Maintaining the original model’s breadth while injecting new knowledge demands a careful balance of data distribution, which is hard without access to the original training distribution. In practice, domain‑specific LLMs frequently become specialized generators that no longer retain broad competence, and rebuilding a full‑scale model may be more cost‑effective than extensive fine‑tuning.

efficiencyLLMFine-tuningRLHFcold startreward hacking
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.