Tagged articles

Credit Assignment

5 articles · Page 1 of 1

Jul 9, 2026 · Artificial Intelligence

Token-Level Credit Assignment Outperforms Broadcast GRPO in LLM Math Reasoning

The paper identifies the broadcast‑style credit assignment of GRPO as a bottleneck for RL‑LLM math reasoning, proposes the Outcome‑Grounded Advantage Reshaping (OAR) framework with token‑importance estimation, and demonstrates that its two variants, OAR‑P and OAR‑G, consistently improve accuracy, training efficiency, and stability across multiple math benchmarks.

Credit AssignmentGRPOLLM

0 likes · 15 min read

Token-Level Credit Assignment Outperforms Broadcast GRPO in LLM Math Reasoning

Machine Learning Algorithms & Natural Language Processing

Jun 5, 2026 · Artificial Intelligence

StepOPSD: Precise Step‑Level Error Detection for Multi‑Turn Agent RL

StepOPSD adds a post‑hoc, step‑aware distillation stage to multi‑turn agent reinforcement learning, splitting rollouts into controllable steps, using successful trajectories as hindsight teachers to compute token‑level advantage adjustments, and demonstrating significant gains on ALFWorld and Search‑QA tasks where reward misalignment is most severe.

ALFWorldAdvantage WeightingAgent RL

0 likes · 13 min read

StepOPSD: Precise Step‑Level Error Detection for Multi‑Turn Agent RL

HyperAI Super Neural

May 28, 2026 · Artificial Intelligence

Large-Model RL Advances: Credit Allocation, Complex Reasoning, Agent Learning

HyperAI curates six cutting‑edge large‑model reinforcement‑learning papers—from ECHO’s free world‑model learning to DelTA’s discriminative token credit, GoLongRL’s capability‑oriented long‑context RL, Anti‑SD’s reverse distillation, RubricEM’s rubric‑guided policy decomposition, and Poly‑EPO’s diversity‑driven exploration—highlighting their methods, benchmarks, and performance gains.

Agent LearningComplex ReasoningCredit Assignment

0 likes · 10 min read

Large-Model RL Advances: Credit Allocation, Complex Reasoning, Agent Learning

Machine Heart

May 23, 2026 · Artificial Intelligence

Why Can’t LLMs Directly Copy AlphaGo’s MCTS Success?

The article analyzes why large language models cannot simply adopt AlphaGo’s Monte‑Carlo Tree Search, highlighting credit‑assignment difficulties, gradient‑variance explosion in multi‑step RL, and how AlphaGo’s tight integration of value and policy networks amortizes search in a way LLMs cannot replicate.

AlphaGoCredit AssignmentLLM

0 likes · 6 min read

Why Can’t LLMs Directly Copy AlphaGo’s MCTS Success?

Baobao Algorithm Notes

Feb 24, 2026 · Artificial Intelligence

The Bitter Lesson of Building Agentic RL in Terminal Environments

This article recounts the challenges of moving from single‑step RL with verifiable rewards to multi‑step agentic reinforcement learning in terminal environments, detailing infrastructure design, asynchronous pipelines, data quality checks, masking strategies, curriculum training, chunk‑based optimization, and practical lessons learned from large‑scale experiments.

Agentic RLAsynchronous TrainingCredit Assignment

0 likes · 33 min read

The Bitter Lesson of Building Agentic RL in Terminal Environments