Tagged articles

43 articles

Page 1 of 1

May 1, 2026 · Artificial Intelligence

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

This article surveys the rapid evolution of reinforcement‑learning algorithms for large‑language‑model inference from early REINFORCE and PPO to newer approaches such as GRPO, RLOO, DAPO, CISPO, DPPO, ScaleRL and MaxRL, highlighting their design motivations, mathematical formulations, empirical trade‑offs and open research challenges.

GRPOLLMMaxRL

0 likes · 27 min read

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

HyperAI Super Neural

Apr 23, 2026 · Artificial Intelligence

Task Tokens Cut Per-Task Trainable Parameters 125× and Boost Convergence 6× for Embodied AI

The Task Tokens method introduced by an Israeli research team reduces the number of trainable parameters per task by up to 125‑fold and speeds up convergence by six times, while preserving the flexibility of Behavior Foundation Models and demonstrating strong performance, robustness, and compatibility across a suite of embodied control tasks.

Behavior Foundation ModelsEmbodied AIMulti-Modal Prompting

0 likes · 13 min read

Task Tokens Cut Per-Task Trainable Parameters 125× and Boost Convergence 6× for Embodied AI

Bighead's Algorithm Notes

Apr 22, 2026 · Artificial Intelligence

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

DeepAries is a novel deep reinforcement‑learning framework that jointly learns when to rebalance a portfolio and how to allocate assets by combining a Transformer‑based state encoder with PPO, and extensive experiments on four major markets show it significantly outperforms fixed‑frequency baselines in risk‑adjusted return, transaction cost, and drawdown.

DeepAriesPPOPortfolio Management

0 likes · 15 min read

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

Wu Shixiong's Large Model Academy

Apr 16, 2026 · Interview Experience

Turn Memorized Answers into Deep Understanding for Tech Interviews

This article explains why interviewers use seemingly rote questions to probe a candidate's true grasp of concepts, contrasts memorization with genuine understanding using PPO vs GRPO, and provides a practical three‑question framework and dialogue examples to help candidates answer technical principle questions confidently.

Answering TechniquesGRPOPPO

0 likes · 12 min read

Turn Memorized Answers into Deep Understanding for Tech Interviews

Machine Learning Algorithms & Natural Language Processing

Mar 17, 2026 · Artificial Intelligence

MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning

A new MIT paper reveals that pretrained large models already contain many hidden expert submodels, and that a simple one‑step Gaussian perturbation (RandOpt) can locate and ensemble these experts to achieve performance comparable to or better than traditional GRPO/PPO tuning, especially as model size grows.

GRPOModel ScalingPPO

0 likes · 9 min read

MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning

Baobao Algorithm Notes

Jan 16, 2026 · Artificial Intelligence

From PPO to SAPO: Evolution of Large‑Model Reinforcement Learning Algorithms

This article systematically reviews the main reinforcement‑learning algorithms—PPO, GRPO, DAPO, GSPO, and SAPO—used for fine‑tuning large language models, explaining why supervised fine‑tuning precedes RL, how each method improves training efficiency and stability, and what trade‑offs they entail.

GRPOPPORL

0 likes · 15 min read

From PPO to SAPO: Evolution of Large‑Model Reinforcement Learning Algorithms

360 Smart Cloud

Nov 14, 2025 · Artificial Intelligence

How TLM Platform Powers LLM Ops with PPO, GRPO and Reinforcement Evaluators

The article introduces the TLM large‑model development platform, details its fine‑tuning options, explains reinforcement learning fundamentals and key algorithms such as PPO and the newer GRPO, describes the architecture of a reinforcement evaluator, and shows how to configure RL training on the platform.

AI PlatformGRPOLLMOps

0 likes · 10 min read

How TLM Platform Powers LLM Ops with PPO, GRPO and Reinforcement Evaluators

Data Party THU

Nov 10, 2025 · Artificial Intelligence

Which Neural Network Method Best Estimates Uncertainty in Regression? A Comparative Study

This article examines why regression models need uncertainty estimates, explains aleatoric and epistemic uncertainty, compares four neural‑network approaches (Mean + LogStd, Mean + LogVariance, MC Dropout, simplified PPO) on a concrete‑strength dataset, and analyzes their experimental performance and limitations.

Monte Carlo DropoutPPOregression

0 likes · 10 min read

Which Neural Network Method Best Estimates Uncertainty in Regression? A Comparative Study

Zhuanzhuan Tech

Oct 29, 2025 · Artificial Intelligence

How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems

This article examines how reinforcement‑learning techniques such as PPO, DPO, and GRPO are integrated into the Baixiaosheng QA system to improve answer stability, deepen domain knowledge understanding, and accelerate response generation, and it evaluates the impact of Reinforcement Fine‑Tuning (RFT) on real‑world performance.

AIDPOGRPO

0 likes · 16 min read

How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems

Fun with Large Models

Sep 24, 2025 · Artificial Intelligence

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

The article explains the fundamental principles of PPO and GRPO reinforcement‑learning algorithms, compares their architectures and training workflows, highlights why GRPO is gaining traction in large‑model fine‑tuning, discusses associated risks, and offers practical guidance on group size selection for engineers preparing for interviews.

GRPOPPORLHF

0 likes · 9 min read

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

Sohu Tech Products

Sep 10, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

This article explains the GRPO algorithm, an improvement over PPO for large language model training that eliminates the value network, uses group‑relative advantage estimation, and offers flexible supervision, resulting in higher efficiency, stability, and performance on tasks such as mathematical reasoning.

AI OptimizationGRPOLLM training

0 likes · 16 min read

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

Data Party THU

Sep 4, 2025 · Artificial Intelligence

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

This article provides a comprehensive technical analysis of PPO‑based reinforcement learning methods for large language models, detailing the evolution from the original PPO algorithm through GRPO, DAPO, and GSPO, and explaining their motivations, mathematical formulations, advantages, and practical challenges such as entropy collapse and importance‑sampling variance.

DAPOGRPOGSPO

0 likes · 30 min read

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

Sohu Tech Products

Sep 3, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF for Large Language Models

This article explains the motivation, mathematical foundations, implementation details, advantages, experimental results, and future directions of Group Relative Policy Optimization (GRPO), a novel reinforcement‑learning algorithm that replaces PPO’s value network with efficient group‑wise relative evaluation for large language models.

GRPOLLMPPO

0 likes · 17 min read

How GRPO Revolutionizes RLHF for Large Language Models

Baobao Algorithm Notes

Aug 15, 2025 · Artificial Intelligence

Unlocking LLM Performance: Classic Deep RL Tricks Reimagined for Modern Training

This article systematically adapts classic deep reinforcement‑learning techniques—such as multi‑step returns, TD(λ)/GAE, V‑trace corrections, uncertainty‑aware weighting, safety constraints, distribution‑robust optimization, and value‑guided decoding—to improve large language model training and inference, providing concrete formulas, implementation tips, and empirical results.

Deep RLGAELLM

0 likes · 17 min read

Unlocking LLM Performance: Classic Deep RL Tricks Reimagined for Modern Training

AI Algorithm Path

Jul 27, 2025 · Artificial Intelligence

Understanding RLHF: How Human Feedback Trains Modern LLMs

This article explains the RLHF (Reinforcement Learning from Human Feedback) pipeline that powers ChatGPT and other large language models, covering the limitations of traditional fine‑tuning, the creation of human‑feedback datasets, reward‑model training, loss design, and the final PPO‑based fine‑tuning step.

ChatGPTHuman FeedbackPPO

0 likes · 8 min read

Understanding RLHF: How Human Feedback Trains Modern LLMs

AI Algorithm Path

Apr 13, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization for LLM Training

The article explains GRPO, a reinforcement‑learning algorithm that extends PPO with group sampling, no value network, dual penalties and KL regularisation, showing how it improves efficiency and stability when fine‑tuning large language models such as DeepSeek‑Math and DeepSeek‑R1.

DeepSeekGRPOPPO

0 likes · 6 min read

Understanding GRPO: Group Relative Policy Optimization for LLM Training

Baobao Algorithm Notes

Mar 20, 2025 · Artificial Intelligence

Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive

This comprehensive guide examines large‑scale deep reinforcement learning, detailing policy‑gradient fundamentals, the mathematics of PPO and GAE, practical implementation tricks, reward and observation normalization, network initialization, and the newer Phasic Policy Gradient method, all supported by code snippets and key research references.

Algorithm OptimizationDeep RLGAE

0 likes · 19 min read

Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive

Data Thinking Notes

Mar 16, 2025 · Artificial Intelligence

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

DeepSeek‑R1 replaces the traditional PPO‑based RLHF approach with GRPO, reducing reliance on human‑labeled data by using pure reinforcement learning environments and carefully designed reward mechanisms; the article explains reinforcement learning fundamentals, compares PPO, DPO and GRPO, and offers practical application recommendations.

AI AlignmentDPOGRPO

0 likes · 14 min read

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

Tencent Technical Engineering

Feb 24, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models

The article reviews reinforcement-learning fundamentals and the progression from policy-gradient to PPO, then introduces Group Relative Policy Optimization (GRPO)—a critic-free method that normalizes rewards across multiple sampled outputs to compute group-relative advantages—and shows how DeepSeek-R1 leverages GRPO with rule-based rewards to achieve strong reasoning performance.

GRPOPPOPolicy Optimization

0 likes · 16 min read

Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models

DevOps

Feb 23, 2025 · Artificial Intelligence

Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications

This article explains how DeepSeek‑R1‑Zero uses group‑relative policy optimization (GRPO) to enhance inference without labeled data, introduces reinforcement learning with human feedback (RLHF) and its components, and compares the PPO and GRPO algorithms, highlighting their suitable engineering scenarios and practical implications for AI applications.

AI model trainingDeep LearningGRPO

0 likes · 15 min read

Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications

Bighead's Algorithm Notes

Feb 15, 2025 · Artificial Intelligence

FinRL‑DeepSeek: How Integrating DeepSeek with RL Improves Portfolio Returns (Code Open‑Source)

This article reviews a new risk‑sensitive trading agent that combines reinforcement learning with large language models to extract stock recommendations and news‑based risk scores, describes the extended CVaR‑PPO algorithm, presents extensive experiments on the FNSPID dataset, and discusses the resulting performance gains and future work.

Algorithmic TradingCVaRDeepSeek

0 likes · 10 min read

FinRL‑DeepSeek: How Integrating DeepSeek with RL Improves Portfolio Returns (Code Open‑Source)

Xiaohongshu Tech REDtech

Jan 2, 2025 · Artificial Intelligence

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Xiaohongshu’s team unveiled a self‑developed RLHF system that trains multimodal large language models using heterogeneous and homogeneous network architectures, extensive PPO optimizations, and Medusa speculative sampling, achieving over 50% throughput gains, reduced hardware needs, and 5‑20% performance improvements on zero‑shot benchmarks.

Distributed TrainingPPOPRM

0 likes · 21 min read

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Baobao Algorithm Notes

Nov 18, 2024 · Artificial Intelligence

Demystifying Actor‑Critic and PPO: From Policy Gradients to Practical RL

This article provides a thorough, step‑by‑step explanation of reinforcement‑learning theory—covering policy‑based objectives, value‑function definitions, the derivation of policy gradients, actor‑critic architecture, advantage estimation, importance sampling, GAE, and the PPO algorithm—aimed at readers with little prior RL knowledge.

PPOactor-criticadvantage estimation

0 likes · 31 min read

Demystifying Actor‑Critic and PPO: From Policy Gradients to Practical RL

Baobao Algorithm Notes

Oct 22, 2024 · Artificial Intelligence

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

This article analytically explores the implicit assumptions behind the RLHF optimization objective, examines how they limit DPO and PPO methods, and proposes practical improvements such as rejection sampling and online on‑policy data selection to narrow the gap between theory and practice.

AI AlignmentDPOPPO

0 likes · 22 min read

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

Baobao Algorithm Notes

Oct 21, 2024 · Artificial Intelligence

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

This article provides a thorough, four‑part overview of RLHF for large language models, covering preference‑optimization algorithms (PPO‑based and offline RL approaches), reward‑model training techniques, inference‑time exploration strategies, and practical implementation details including the OpenRLHF framework and resource‑allocation tricks.

DPOLLM optimizationOpenRLHF

0 likes · 27 min read

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

NewBeeNLP

Sep 23, 2024 · Artificial Intelligence

Why Post‑Training Is Redefining LLMs: DPO vs PPO, Synthetic Data, and Scaling Strategies

This article analyzes recent post‑training trends in large language models, comparing DPO and PPO, examining the scarcity of open‑source preference data, the iterative training process, the rise of synthetic data pipelines, and emerging methods for improving math and reasoning capabilities.

DPOLLMPPO

0 likes · 12 min read

Why Post‑Training Is Redefining LLMs: DPO vs PPO, Synthetic Data, and Scaling Strategies

Python Programming Learning Circle

Sep 10, 2024 · Artificial Intelligence

Using TorchRL to Implement Multi‑Agent PPO for MARL

This tutorial explains how to set up a multi‑agent reinforcement learning (MARL) environment with VMAS, install required dependencies, configure PPO hyper‑parameters, build policy and critic networks, collect data with TorchRL, and run a training loop to train agents for coordinated navigation tasks.

Deep LearningPPOTorchRL

0 likes · 10 min read

Using TorchRL to Implement Multi‑Agent PPO for MARL

Baobao Algorithm Notes

May 30, 2024 · Artificial Intelligence

What’s the Latest RLHF Landscape? From PPO to ORPO Explained

This article surveys the current RLHF ecosystem, comparing on‑policy methods like PPO with off‑policy approaches such as DPO, and examines recent variants—including ReMax, GRPO, DPOP, TDPO, and ORPO—highlighting their algorithmic differences, resource trade‑offs, and practical performance insights.

AlignmentDPOLLM

0 likes · 23 min read

What’s the Latest RLHF Landscape? From PPO to ORPO Explained

NewBeeNLP

Apr 1, 2024 · Artificial Intelligence

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

This article provides a detailed technical walkthrough of Llama 2's Reinforcement Learning with Human Feedback pipeline, covering human preference data collection, reward‑model design and training, iterative fine‑tuning with PPO and rejection sampling, the Ghost Attention technique for multi‑turn consistency, and the resulting experimental evaluations.

Ghost AttentionLlama-2PPO

0 likes · 18 min read

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

Network Intelligence Research Center (NIRC)

Nov 27, 2023 · Artificial Intelligence

How Deep Reinforcement Learning Shapes 15‑Minute City Community Planning

This article explains how a deep reinforcement learning model, built on a graph‑based representation of urban elements and trained with PPO, can automate land‑use and road planning to achieve Service, Ecology, and Traffic objectives for 15‑minute city neighborhoods.

15-minute cityGraph Neural NetworkPPO

0 likes · 9 min read

How Deep Reinforcement Learning Shapes 15‑Minute City Community Planning

Baobao Algorithm Notes

Oct 9, 2023 · Artificial Intelligence

Demystifying RLHF and PPO for Large Language Models: Theory and Practice

This article explains why Reinforcement Learning from Human Feedback (RLHF) is crucial for LLM intelligence, outlines the three-stage training pipeline, details InstructGPT's reward model and PPO optimization, and provides a practical guide to implementing RLHF with deep‑learning frameworks.

PPORLHFReward Modeling

0 likes · 17 min read

Demystifying RLHF and PPO for Large Language Models: Theory and Practice

Baidu Geek Talk

Aug 16, 2023 · Artificial Intelligence

Understanding Reinforcement Learning: From Basics to PPO and Policy Gradient

This article provides a comprehensive overview of reinforcement learning, covering fundamental concepts, differences from supervised learning, algorithm families, policy gradient methods, practical tricks like baselines and reward‑to‑go, and detailed explanations of TRPO and PPO variants with illustrative diagrams.

PPOactor-criticmachine learning

0 likes · 19 min read

Understanding Reinforcement Learning: From Basics to PPO and Policy Gradient

Baobao Algorithm Notes

Jul 16, 2023 · Artificial Intelligence

Why High RM Scores Don't Guarantee Better LLMs: 7 RLHF Tricks for Stable PPO Training

The article examines why rising RM scores in large‑model training don't ensure superior LLM performance and presents seven practical RLHF tricks—ranging from KL‑penalty to global gradient clipping—that improve PPO stability and reduce resource overhead.

LLM trainingPPORLHF

0 likes · 7 min read

Why High RM Scores Don't Guarantee Better LLMs: 7 RLHF Tricks for Stable PPO Training

IT Architects Alliance

Feb 23, 2023 · Artificial Intelligence

Training a Positive Review Generator with RLHF and PPO

This article demonstrates how to use Reinforcement Learning from Human Feedback (RLHF) with a PPO algorithm and a sentiment‑analysis model to train a language model that generates positive product reviews, covering task definition, data sampling, reward evaluation, model optimization, and experimental results.

GPTLanguage ModelPPO

0 likes · 11 min read

Training a Positive Review Generator with RLHF and PPO

Architect

Feb 19, 2023 · Artificial Intelligence

Training a Positive Review Generator with RLHF and PPO

This article demonstrates how to apply Reinforcement Learning from Human Feedback (RLHF) using a sentiment‑analysis model as a reward function and Proximal Policy Optimization (PPO) to fine‑tune a language model that generates positive product reviews, complete with code snippets and experimental results.

Language ModelPPORLHF

0 likes · 10 min read

dbaplus Community

Feb 18, 2023 · Artificial Intelligence

Why ChatGPT Still Gets It Wrong: Inside RLHF and Model Consistency

ChatGPT, OpenAI’s latest language model, builds on GPT‑3 but uses supervised fine‑tuning and Reinforcement Learning from Human Feedback (RLHF) to improve alignment, yet its training methods still cause consistency issues such as invalid help, hallucinations, bias, and limited explainability.

ChatGPTModel AlignmentPPO

0 likes · 17 min read

Why ChatGPT Still Gets It Wrong: Inside RLHF and Model Consistency

IT Architects Alliance

Feb 9, 2023 · Artificial Intelligence

How ChatGPT Works: Model Architecture, Training Strategies, and RLHF

ChatGPT, OpenAI’s latest language model, builds on GPT‑3 using supervised fine‑tuning and Reinforcement Learning from Human Feedback (RLHF) with PPO, addressing consistency issues by aligning model outputs with human preferences, while discussing training methods, limitations, and evaluation metrics.

AI AlignmentChatGPTPPO

0 likes · 15 min read

How ChatGPT Works: Model Architecture, Training Strategies, and RLHF

Architects' Tech Alliance

Feb 7, 2023 · Artificial Intelligence

ChatGPT: Technical Principles, Architecture, and the Role of Human‑Feedback Reinforcement Learning

This article explains how ChatGPT builds on GPT‑3 with improved accuracy and coherence, details its training pipeline that combines supervised fine‑tuning and Reinforcement Learning from Human Feedback (RLHF), discusses consistency challenges, evaluation metrics, and the limitations of the RLHF approach.

AI AlignmentChatGPTPPO

0 likes · 15 min read

Architect

Feb 6, 2023 · Artificial Intelligence

Understanding How ChatGPT Works: RLHF, PPO, and Consistency Challenges

This article explains the underlying mechanisms of ChatGPT, including its GPT‑3 foundation, the role of supervised fine‑tuning, human‑feedback reinforcement learning (RLHF), PPO optimization, consistency issues, evaluation metrics, and the limitations of these training strategies, with references to key research papers.

AI AlignmentChatGPTPPO

0 likes · 16 min read

Understanding How ChatGPT Works: RLHF, PPO, and Consistency Challenges

Tencent Cloud Developer

Dec 9, 2022 · Artificial Intelligence

An Overview of ChatGPT: Technology, Training Process, and Applications

The article outlines ChatGPT’s conversational capabilities, its InstructGPT‑based architecture, a three‑stage RLHF training pipeline involving supervised fine‑tuning, human‑ranked response generation, and PPO optimization, and discusses its strengths, limitations, diverse applications, and future directions for multimodal, up‑to‑date assistants.

AI applicationsChatGPTPPO

0 likes · 18 min read

An Overview of ChatGPT: Technology, Training Process, and Applications

Huawei Cloud Developer Alliance

Jun 1, 2022 · Artificial Intelligence

How AI Beats Super Mario with PPO in 5 Minutes

This tutorial demonstrates how to use Huawei Cloud ModelArts and the Proximal Policy Optimization (PPO) reinforcement‑learning algorithm to train an AI agent that can clear most Super Mario levels within about 1500 episodes, even for users with no coding experience.

AIModelArtsPPO

0 likes · 6 min read

How AI Beats Super Mario with PPO in 5 Minutes

IEG Growth Platform Technology Team

Dec 6, 2021 · Artificial Intelligence

Model-Free Reinforcement Learning for ROI Optimization: Methods, Advertising Applications, and Tencent Game Advertising Practice

This article introduces model‑free reinforcement learning fundamentals, reviews mainstream solution methods such as Monte‑Carlo, Temporal‑Difference, and n‑step TD with eligibility traces, discusses their application in online advertising and presents Tencent's game advertising practice, including algorithm choices, reward design, and experimental results.

A3CAdvertisingPPO

0 likes · 17 min read

Model-Free Reinforcement Learning for ROI Optimization: Methods, Advertising Applications, and Tencent Game Advertising Practice

DataFunTalk

Oct 4, 2020 · Artificial Intelligence

Reinforcement Learning for Product Ranking: Model Design, Experiments, and Online Deployment

This article presents a comprehensive study of using reinforcement learning to improve e‑commerce product ranking, covering the limitations of traditional scoring, the design of context‑aware models, a pointer‑network based sequence generator, various RL algorithms, extensive offline evaluations, and successful online deployment with future research directions.

Deep LearningPPOe‑commerce

0 likes · 28 min read

Reinforcement Learning for Product Ranking: Model Design, Experiments, and Online Deployment