A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL
The article reviews five years of LLM‑centric reinforcement learning, tracing the evolution from early Q‑learning to PPO, then to Direct Preference Optimization, Group Relative Policy Optimization, and finally multi‑agent RL, detailing each method’s mechanics, strengths, failure modes, practical considerations, and emerging open‑source toolchains.
