Why Scaling, Data, and Infra Matter More Than Reward Design in R1 Replication
The article analyses two months of community attempts to reproduce DeepSeek R1, highlighting that model scaling, high‑quality data, robust training infrastructure, and careful hyper‑parameter tuning outweigh pure reward‑based tricks, and it outlines common pitfalls and future research directions.
Quick Recap
Two months after the R1 technical report, the open‑source community produced a flood of replication attempts; this piece summarizes personal and community insights gained during that period.
Base model and data distribution are all you need
Reproducing R1 showed divergent trends for Qwen and LLaMA families. RL is meant to unlock a model's latent potential, so a model without potential cannot be improved. Data distribution matters equally: well‑curated datasets such as the 57K‑sample ORMs enable stable training, while merely increasing sample size without quality does not guarantee success.
Scaling is all you need
Results indicate that "correct answer" performance scales better than "process‑correct + answer" performance; R1’s success does not invalidate PRM approaches but demonstrates that 100 K ORM samples outperform 10 K PRM samples. The prevailing development paradigm is "scaling + fine‑tuning": first invest heavily in large‑scale data, then gradually reduce cost and polish details.
Infra is all you need
RLHF and LLM‑reasoning differ mainly in prompt difficulty and response length. Modern workloads now generate responses of tens of thousands of tokens, stressing GPU memory and decoding time. Frameworks like vllm and sglang have become standard for RLHF. Maintaining a critic model of the same size as the actor model is crucial for training stability, even though early impressions underestimated its impact.
Hyper‑parameter is all you need
Many teams skipped extensive hyper‑parameter searches because they lacked confidence in rule‑based rewards. Consequently, they relied on a handful of settings (learning rate, KL coefficient, discount factor, sample count, RL algorithm) to draw conclusions, often misinterpreting noisy results as evidence of methodological failure.
Common Pitfalls
Pitfall 1: Assuming base‑model reinforcement is the key
Reinforcing a base model offers maximal exploration but poor follow‑format ability; reinforcing a long‑COT model improves format compliance but yields overly long initial outputs; reinforcing an instruct model limits exploration due to a fixed thinking pattern.
Base‑model RL: high exploration, weak follow‑format.
Long‑COT model RL: strong format, long initial outputs.
Instruct model RL: limited exploration, poorest replication results.
Pitfall 2: Mistaking response‑length growth for progress
Longer responses often correlate with larger exploration space, but increasing length alone does not guarantee higher quality. Simply inflating token probabilities or adding noise to boost length can produce repetitive or nonsensical outputs without real improvement.
Pitfall 3: Logging too few metrics
Comprehensive logging is essential because it adds no training overhead. Useful metrics include:
Output homogeneity (entropy, edit distance, N‑gram repetition).
Response length under various conditions (correct vs. incorrect, with/without reflection).
Accuracy broken down by prompt, reflection pattern, and length thresholds.
Model anomalies (format violations, length overruns, repetition, language mixing).
Algorithm anomalies (clip frequency, overflow/underflow ratios).
Pitfall 4: Forgetting the experiment goal
Researchers sometimes chase attractive numbers (e.g., high entropy) while losing sight of the ultimate objective: improving reward stability and encouraging high‑quality reasoning patterns, not merely inflating response length.
Future Outlook
Researchers with strong mathematical backgrounds should explore novel RL algorithms or theoretically analyze existing components such as clipping, KL loss, and advantage normalization. Those focused on combinatorial aspects can experiment with KL loss inclusion, dynamic temperature scheduling, or curriculum learning for prompts. Infrastructure‑savvy engineers should improve training framework stability and address subtle precision differences between inference engines (e.g., vllm vs. model.forward()). Finally, extending long‑COT capabilities learned from code/math tasks to broader general‑purpose reasoning remains a major challenge.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
