Tagged articles

RLVR

12 articles · Page 1 of 1

Jun 28, 2026 · Artificial Intelligence

Can AI Learn on the Job? RLVR, OPSD, and Dreaming for the Next‑Gen Training Paradigm

The article examines Dwarkesh Patel’s view that future AI must move beyond one‑off pre‑training to continual, on‑the‑job learning, discussing Reinforcement Learning with Verifiable Rewards (RLVR), the need for "grindable" tasks, and emerging approaches like on‑policy self‑distillation (OPSD) and "dreaming" to write real‑world experience back into model weights.

AI Training ParadigmsContinual LearningDreaming

0 likes · 12 min read

Can AI Learn on the Job? RLVR, OPSD, and Dreaming for the Next‑Gen Training Paradigm

Data Party THU

Jun 5, 2026 · Artificial Intelligence

A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL

This article reviews the five‑year evolution of reinforcement‑learning techniques for large language models, comparing PPO, DPO, GRPO and emerging multi‑agent approaches, analyzing their reward signals, practical trade‑offs, and the open‑source frameworks that support them.

DPOGRPOLLM

0 likes · 34 min read

A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL

Machine Heart

May 14, 2026 · Artificial Intelligence

Breaking Homogeneous Reasoning: I²B‑LPO Guides RLVR from Repeated Sampling to Effective Exploration

I²B‑LPO is an exploration‑enhancement framework for RLVR that branches rollouts at high‑entropy nodes, injects latent variables via pseudo self‑attention, and filters paths with an information‑bottleneck self‑reward, achieving up to 5.3% accuracy and 7.4% diversity improvements on multiple math reasoning benchmarks.

RLVRentropyexploration

0 likes · 14 min read

Breaking Homogeneous Reasoning: I²B‑LPO Guides RLVR from Repeated Sampling to Effective Exploration

Lao Guo's Learning Space

Apr 2, 2026 · Artificial Intelligence

Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution

This article explains the full lifecycle of large language models in 2026, covering pretraining fundamentals, the limits of classic Scaling Laws, data‑centric advances, fine‑tuning strategies, RLHF, DPO, and the emerging post‑training methods GRPO, DAPO and RLVR, with concrete benchmarks and cost analyses.

DAPODPOGRPO

0 likes · 17 min read

Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution

Machine Learning Algorithms & Natural Language Processing

Mar 28, 2026 · Artificial Intelligence

A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL

This article systematically explains the post‑training pipeline for large language models, covering supervised fine‑tuning, RLHF, PPO, GRPO, RLVR, DPO and emerging Agentic RL, while illustrating each method with analogies, detailed workflows, tables, and recent research findings.

Agentic RLDPOGRPO

0 likes · 24 min read

A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL

Amazon Cloud Developers

Jan 20, 2026 · Artificial Intelligence

Boost Model Accuracy by 66% with Amazon Bedrock Reinforcement Fine‑Tuning

Amazon Bedrock’s new reinforcement fine‑tuning feature lets developers create smaller, faster, more accurate models—up to 66% higher accuracy—without deep ML expertise or large labeled datasets, offering automated workflows, two reward‑based learning options (RLVR and RLAIF), and built‑in security for cost‑effective model customization.

AIAmazon BedrockModel Customization

0 likes · 10 min read

Boost Model Accuracy by 66% with Amazon Bedrock Reinforcement Fine‑Tuning

Design Hub

Dec 20, 2025 · Artificial Intelligence

Must-Read: K's 2025 AI Review – 6 Paradigm Shifts Reshaping Our World

The article reviews six 2025 paradigm shifts in large language models—from the rise of verifiable‑reward reinforcement learning and the emergence of AI "ghosts" to new "Cursor for X" middle layers, local agents like Claude Code, Vibe Coding that lets users program by conversation, and visual interaction driven by Gemini Nano Banana—highlighting their technical impact and design implications.

AI agentsLLMRLVR

0 likes · 12 min read

Must-Read: K's 2025 AI Review – 6 Paradigm Shifts Reshaping Our World

PaperAgent

Dec 20, 2025 · Industry Insights

What 2025 Tells Us About the Future of Large Language Models

The 2025 LLM year‑in‑review highlights paradigm shifts such as RLVR training, uneven “saw‑tooth” intelligence, the rise of Cursor‑style applications, Claude Code agents running locally, Vibe Coding, and the Nano Banana GUI revolution, concluding that current models only exploit about 10 % of their potential.

AI agentsIndustry TrendsLLM

0 likes · 10 min read

What 2025 Tells Us About the Future of Large Language Models

Baobao Algorithm Notes

Nov 11, 2025 · Artificial Intelligence

Why Redesign the Training Stack? Inside Olmo‑Thinking’s Open‑Source RL Journey

This article provides a detailed technical analysis of the Olmo‑Thinking project, covering why a new open‑source LLM was built, the challenges of reinforcement learning at scale, data‑mix optimization, architectural bottlenecks such as missing GQA and QK‑Norm, and the post‑training techniques used to improve reasoning and long‑context capabilities.

RLVRdata selectionopen-source models

0 likes · 20 min read

Why Redesign the Training Stack? Inside Olmo‑Thinking’s Open‑Source RL Journey

Data Party THU

Oct 9, 2025 · Artificial Intelligence

How Reinforcement Learning Is Transforming the Full Lifecycle of Large Language Models

This survey systematically reviews recent advances in applying reinforcement learning across the entire lifecycle of large language models, detailing methods, datasets, benchmarks, open‑source tools, and future challenges such as scalability, reward design, and evaluation standards.

AI SurveyLLM lifecycleLarge Language Models

0 likes · 9 min read

How Reinforcement Learning Is Transforming the Full Lifecycle of Large Language Models

Data Party THU

Aug 19, 2025 · Artificial Intelligence

Why RL Fine‑Tuning Fails to Extend LLM Reasoning Limits: Entropy Collapse Explained

This article examines how reinforcement learning fine‑tuning influences large language model reasoning, revealing that RL primarily amplifies pre‑trained capabilities, suffers from entropy collapse, and fails to push the model’s reasoning boundary, supported by extensive experiments on scaling laws, entropy analysis, and mitigation techniques.

LLMRLRLVR

0 likes · 24 min read

Why RL Fine‑Tuning Fails to Extend LLM Reasoning Limits: Entropy Collapse Explained

DataFunTalk

Apr 25, 2025 · Artificial Intelligence

Does Reinforcement Learning Really Expand Reasoning Capacity in Large Language Models? Insights from Recent Empirical Study

Recent empirical research by Tsinghua’s LeapLab and Shanghai Jiao Tong University reveals that reinforcement‑learning‑based fine‑tuning (RLVR) improves sampling efficiency but does not extend the fundamental reasoning abilities of large language models beyond their base capabilities, as demonstrated across mathematics, code, and visual reasoning benchmarks.

AI researchLarge Language ModelsRLVR

0 likes · 12 min read