Baobao Algorithm Notes
Oct 7, 2024 · Artificial Intelligence
Decoding OpenAI’s o1: How RL and Process‑Supervised Reward Models Might Power the Next LLM
The author speculates on OpenAI’s o1 architecture, proposing that it relies on reinforcement learning guided by a generalizable, process‑supervised reward model, and outlines data collection, multi‑model generation, and training tweaks needed to realize such a system.
AI researchLLMRLHF
0 likes · 8 min read
