Tagged articles

RL

16 articles · Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

Jun 7, 2026 · Artificial Intelligence

AgentDoG 1.5: A Lightweight, Extensible Framework for Trajectory‑Level Agent Safety

AgentDoG 1.5 expands AI‑agent safety from final replies to complete execution trajectories, introducing the ATBench family for fine‑grained evaluation, a taxonomy‑guided DataEngine for high‑quality data generation, and demonstrating substantial safety gains in both SFT/RL training and online guardrail deployment with lightweight models.

AI safetyATBenchAgentDoG

0 likes · 14 min read

AgentDoG 1.5: A Lightweight, Extensible Framework for Trajectory‑Level Agent Safety

Machine Learning Algorithms & Natural Language Processing

May 25, 2026 · Artificial Intelligence

VeRL-Omni: A Universal RL Post‑Training Framework for Diffusion and Multimodal Generation Models

VeRL-Omni introduces a universal reinforcement‑learning post‑training framework that extends the verl and vLLM‑Omni stacks to support diffusion transformers, hybrid AR‑DiT, and unified understanding‑generation models, offering high‑throughput multimodal rollout, flexible reward engines, modular trainers, and broad hardware compatibility.

FlowGRPOMultimodal GenerationRL

0 likes · 9 min read

VeRL-Omni: A Universal RL Post‑Training Framework for Diffusion and Multimodal Generation Models

Machine Heart

May 25, 2026 · Artificial Intelligence

VeRL-Omni: Universal RL Post‑Training for Diffusion and Multimodal Models

VeRL-Omni is an open‑source RL post‑training framework built on verl and vLLM‑Omni that enables efficient, high‑throughput rollout and flexible reward computation for diffusion, AR‑DiT, and unified multimodal generation models, supporting diverse hardware, modular trainers, and demonstrating up to 14% latency reduction and high training throughput in benchmark experiments.

Diffusion ModelsFlowGRPOMultimodal Generation

0 likes · 9 min read

VeRL-Omni: Universal RL Post‑Training for Diffusion and Multimodal Models

AI Step-by-Step

Apr 30, 2026 · Artificial Intelligence

How Hermes Turns Runtime Agent Executions into a Closed‑Loop Training Pipeline

The article explains how Hermes structures the runtime execution of agents—capturing tool calls, context changes, results, and rewards—so that these trajectories can be evaluated, fine‑tuned, and fed into reinforcement‑learning loops, creating a continuous improvement cycle.

Agent RuntimeAtroposHermes

0 likes · 16 min read

How Hermes Turns Runtime Agent Executions into a Closed‑Loop Training Pipeline

Machine Heart

Apr 4, 2026 · Artificial Intelligence

SFT Scores Don’t Predict RL Potential: Adaptive Early‑Stop Loss for LLMs

The authors show that high SFT accuracy does not guarantee strong RL performance because over‑fitting reduces output diversity, and they propose Adaptive Early‑Stop Loss (AESL), a diversity‑aware early‑stopping objective that dynamically weights token and subsequence losses, yielding consistently better RL results on multiple LLMs and math benchmarks.

AESLDiversityLLM

0 likes · 11 min read

SFT Scores Don’t Predict RL Potential: Adaptive Early‑Stop Loss for LLMs

Baobao Algorithm Notes

Mar 3, 2026 · Artificial Intelligence

Boosting LLM Post-Training with RL: Tips for Efficiency and Stability

This article shares practical insights and pitfalls from six months of applying reinforcement learning to fine‑tune large language models, covering exploration efficiency, training stability, model selection, and special considerations for thinking‑oriented agents.

AIEfficiencyLLM

0 likes · 12 min read

Boosting LLM Post-Training with RL: Tips for Efficiency and Stability

PaperAgent

Jan 19, 2026 · Artificial Intelligence

How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions

Recent research shows that applying reinforcement learning to large language models can dramatically improve inference performance, but its effectiveness depends on the token distribution produced during pre‑training, prompting a novel rewrite of cross‑entropy as a single‑step policy gradient with controllable entropy parameters.

LLMModel OptimizationRL

0 likes · 6 min read

How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions

Baobao Algorithm Notes

Jan 16, 2026 · Artificial Intelligence

From PPO to SAPO: Evolution of Large‑Model Reinforcement Learning Algorithms

This article systematically reviews the main reinforcement‑learning algorithms—PPO, GRPO, DAPO, GSPO, and SAPO—used for fine‑tuning large language models, explaining why supervised fine‑tuning precedes RL, how each method improves training efficiency and stability, and what trade‑offs they entail.

GRPOLarge Language ModelsPPO

0 likes · 15 min read

From PPO to SAPO: Evolution of Large‑Model Reinforcement Learning Algorithms

Data Party THU

Sep 15, 2025 · Artificial Intelligence

Why Merge SFT and RL? Exploring Unified Fine‑Tuning Strategies for LLMs

This article examines the necessity of integrating Supervised Fine‑Tuning (SFT) with Reinforcement Learning (RL) for large language models, surveys alternating, sample‑reuse, simultaneous, and hint‑guided fusion methods, presents the underlying loss functions, and discusses practical trade‑offs such as entropy collapse and importance‑sampling corrections.

AILLMRL

0 likes · 14 min read

Why Merge SFT and RL? Exploring Unified Fine‑Tuning Strategies for LLMs

AI2ML AI to Machine Learning

Aug 25, 2025 · Artificial Intelligence

Decoding OpenAI’s Multi‑Level AGI Roadmap

The article analyzes OpenAI’s five‑layer AGI roadmap, compares it with DeepMind’s ECEVS framework, and examines the technical progress from L1 to L5—including RL‑enhanced chain‑of‑thought, ReAct agents, deep research, and upcoming innovations—while highlighting the commercial implications of each stage.

AGIArtificial IntelligenceChain-of-Thought

0 likes · 7 min read

Decoding OpenAI’s Multi‑Level AGI Roadmap

Data Party THU

Aug 19, 2025 · Artificial Intelligence

Why RL Fine‑Tuning Fails to Extend LLM Reasoning Limits: Entropy Collapse Explained

This article examines how reinforcement learning fine‑tuning influences large language model reasoning, revealing that RL primarily amplifies pre‑trained capabilities, suffers from entropy collapse, and fails to push the model’s reasoning boundary, supported by extensive experiments on scaling laws, entropy analysis, and mitigation techniques.

LLMRLRLVR

0 likes · 24 min read

Why RL Fine‑Tuning Fails to Extend LLM Reasoning Limits: Entropy Collapse Explained

Data Party THU

Aug 7, 2025 · Artificial Intelligence

Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability

The paper identifies that GRPO’s token‑level importance weighting introduces high‑variance noise causing instability in large‑scale language model RL training, and proposes GSPO, a sequence‑level importance sampling method that aligns with reward definitions, improves gradient stability, and yields higher training efficiency and better performance across benchmarks.

GRPOGSPORL

0 likes · 8 min read

Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability

Baobao Algorithm Notes

Mar 27, 2025 · Artificial Intelligence

Why a Robust Training Pipeline Beats Fancy LLM Tricks – Lessons from DAPO

The article analyzes the DAPO technical report, showing how dynamic‑sampling pipelines and token‑level loss handling in SFT and RL training outperform ad‑hoc algorithm tricks, and compares the training dynamics of reinforce_baseline and GRPO with concrete code examples.

Dynamic SamplingGRPOLLM

0 likes · 8 min read

Why a Robust Training Pipeline Beats Fancy LLM Tricks – Lessons from DAPO

Bighead's Algorithm Notes

Feb 15, 2025 · Artificial Intelligence

FinRL‑DeepSeek: How Integrating DeepSeek with RL Improves Portfolio Returns (Code Open‑Source)

This article reviews a new risk‑sensitive trading agent that combines reinforcement learning with large language models to extract stock recommendations and news‑based risk scores, describes the extended CVaR‑PPO algorithm, presents extensive experiments on the FNSPID dataset, and discusses the resulting performance gains and future work.

Algorithmic TradingCVaRDeepSeek

0 likes · 10 min read

FinRL‑DeepSeek: How Integrating DeepSeek with RL Improves Portfolio Returns (Code Open‑Source)

Baobao Algorithm Notes

Sep 25, 2024 · Industry Insights

Decoding OpenAI o1: How RL and LLM Fuse to Power Hidden Chain‑of‑Thought

This article analytically reconstructs OpenAI o1’s architecture, training pipeline, and inference workflow, exploring its reinforcement‑learning‑enhanced hidden chain‑of‑thought, multi‑model composition, scaling laws, reward modeling, and potential implications for future AI safety and small‑model strategies.

AI safetyHidden COTLLM

0 likes · 43 min read

Decoding OpenAI o1: How RL and LLM Fuse to Power Hidden Chain‑of‑Thought

Ctrip Technology

Jun 19, 2019 · Artificial Intelligence

Applying Reinforcement Learning to Hotel Ranking at Ctrip: Challenges, Solutions, and Preliminary Results

This article examines the limitations of traditional learning‑to‑rank for Ctrip hotel sorting, introduces reinforcement learning as a remedy, outlines three progressive implementation plans (A, B, C) with algorithm choices and engineering trade‑offs, and presents early experimental findings that demonstrate RL's potential to improve conversion rates.

CtripRLRanking

0 likes · 15 min read

Applying Reinforcement Learning to Hotel Ranking at Ctrip: Challenges, Solutions, and Preliminary Results