Why the Once‑Rejected PPO Algorithm Became a Pillar of Modern LLM Training

The article recounts how Proximal Policy Optimization, initially dismissed by NeurIPS 2017 for limited novelty, later became a cornerstone of RLHF and large‑language‑model training, illustrating how academic evaluation can miss long‑term impact, with parallels to other once‑rejected breakthroughs such as LSTM, SIFT and Dropout.

Machine Heart
Machine Heart
Machine Heart
Why the Once‑Rejected PPO Algorithm Became a Pillar of Modern LLM Training

PPO (Proximal Policy Optimization) is now a classic algorithm widely used in RLHF and large‑language‑model (LLM) training, but its original paper was rejected by NeurIPS 2017. The author cites John Schulman, one of PPO’s creators, who simply states that PPO was turned away by the conference.

The earliest version of the paper appeared in July 2017. It presented PPO as a simpler, more engineering‑friendly policy‑optimization method that aimed to retain the stability of TRPO while reducing implementation complexity, making reinforcement‑learning training easier to tune and more practical.

Years later, PPO’s prominence grew not through traditional RL benchmarks like Atari or robot control, but through its adoption in the training pipelines of large language models. According to Schulman, PPO experienced a second surge of popularity in the LLM era, for reasons that exceeded the original expectations of the paper.

Schulman explains that the paper was initially rejected because reviewers considered its innovation limited and the empirical gains over existing baselines insufficient. A comment from a netizen expands this view, noting a mismatch between academic evaluation— which values novelty and modest benchmark improvements— and real‑world needs, which prioritize scalability, stability in complex systems, and practical usability.

Schulman reflects calmly that the episode happened long ago and hopes that the academic community has since come to appreciate the “simple yet scalable” aesthetic of such methods.

The article uses PPO’s story to illustrate a broader point: an algorithm’s lasting impact cannot always be judged at submission time. It cites several other influential works that were also rejected before becoming foundational: LSTM (rejected by NIPS 1996 for complexity and lack of biological plausibility, later central to speech recognition and machine translation), SIFT (rejected by ICCV 1997 and CVPR 1998 for cumbersome engineering, later dominated computer‑vision pipelines for over a decade), and Dropout (rejected by NIPS 2012 as an engineering hack with weak theory, later becoming a core regularization technique and earning a NeurIPS Test‑of‑Time award).

These examples reinforce the notion that time is the strictest and fairest reviewer, allowing truly impactful ideas to surface despite early setbacks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Large Language ModelsReinforcement LearningRLHFNeurIPSPPOAlgorithm Rejection
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.