Why Scaling, Data, and Infra Matter More Than Reward Design in R1 Replication

The article analyses two months of community attempts to reproduce DeepSeek R1, highlighting that model scaling, high‑quality data, robust training infrastructure, and careful hyper‑parameter tuning outweigh pure reward‑based tricks, and it outlines common pitfalls and future research directions.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why Scaling, Data, and Infra Matter More Than Reward Design in R1 Replication

Quick Recap

Two months after the R1 technical report, the open‑source community produced a flood of replication attempts; this piece summarizes personal and community insights gained during that period.

Base model and data distribution are all you need

Reproducing R1 showed divergent trends for Qwen and LLaMA families. RL is meant to unlock a model's latent potential, so a model without potential cannot be improved. Data distribution matters equally: well‑curated datasets such as the 57K‑sample ORMs enable stable training, while merely increasing sample size without quality does not guarantee success.

Scaling is all you need

Results indicate that "correct answer" performance scales better than "process‑correct + answer" performance; R1’s success does not invalidate PRM approaches but demonstrates that 100 K ORM samples outperform 10 K PRM samples. The prevailing development paradigm is "scaling + fine‑tuning": first invest heavily in large‑scale data, then gradually reduce cost and polish details.

Infra is all you need

RLHF and LLM‑reasoning differ mainly in prompt difficulty and response length. Modern workloads now generate responses of tens of thousands of tokens, stressing GPU memory and decoding time. Frameworks like vllm and sglang have become standard for RLHF. Maintaining a critic model of the same size as the actor model is crucial for training stability, even though early impressions underestimated its impact.

Hyper‑parameter is all you need

Many teams skipped extensive hyper‑parameter searches because they lacked confidence in rule‑based rewards. Consequently, they relied on a handful of settings (learning rate, KL coefficient, discount factor, sample count, RL algorithm) to draw conclusions, often misinterpreting noisy results as evidence of methodological failure.

Common Pitfalls

Pitfall 1: Assuming base‑model reinforcement is the key

Reinforcing a base model offers maximal exploration but poor follow‑format ability; reinforcing a long‑COT model improves format compliance but yields overly long initial outputs; reinforcing an instruct model limits exploration due to a fixed thinking pattern.

Base‑model RL: high exploration, weak follow‑format.

Long‑COT model RL: strong format, long initial outputs.

Instruct model RL: limited exploration, poorest replication results.

Pitfall 2: Mistaking response‑length growth for progress

Longer responses often correlate with larger exploration space, but increasing length alone does not guarantee higher quality. Simply inflating token probabilities or adding noise to boost length can produce repetitive or nonsensical outputs without real improvement.

Pitfall 3: Logging too few metrics

Comprehensive logging is essential because it adds no training overhead. Useful metrics include:

Output homogeneity (entropy, edit distance, N‑gram repetition).

Response length under various conditions (correct vs. incorrect, with/without reflection).

Accuracy broken down by prompt, reflection pattern, and length thresholds.

Model anomalies (format violations, length overruns, repetition, language mixing).

Algorithm anomalies (clip frequency, overflow/underflow ratios).

Pitfall 4: Forgetting the experiment goal

Researchers sometimes chase attractive numbers (e.g., high entropy) while losing sight of the ultimate objective: improving reward stability and encouraging high‑quality reasoning patterns, not merely inflating response length.

Future Outlook

Researchers with strong mathematical backgrounds should explore novel RL algorithms or theoretically analyze existing components such as clipping, KL loss, and advantage normalization. Those focused on combinatorial aspects can experiment with KL loss inclusion, dynamic temperature scheduling, or curriculum learning for prompts. Infrastructure‑savvy engineers should improve training framework stability and address subtle precision differences between inference engines (e.g., vllm vs. model.forward()). Finally, extending long‑COT capabilities learned from code/math tasks to broader general‑purpose reasoning remains a major challenge.

LLMDeepSeekRLHFInfrastructureScalinghyperparametersreproducibility
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.