Tag

PPO

0 views collected around this technical thread.

Tencent Technical Engineering
Tencent Technical Engineering
Feb 24, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models

The article reviews reinforcement-learning fundamentals and the progression from policy-gradient to PPO, then introduces Group Relative Policy Optimization (GRPO)—a critic-free method that normalizes rewards across multiple sampled outputs to compute group-relative advantages—and shows how DeepSeek-R1 leverages GRPO with rule-based rewards to achieve strong reasoning performance.

GRPOPPORLHF
0 likes · 16 min read
Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models
DevOps
DevOps
Feb 23, 2025 · Artificial Intelligence

Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications

This article explains how DeepSeek‑R1‑Zero uses group‑relative policy optimization (GRPO) to enhance inference without labeled data, introduces reinforcement learning with human feedback (RLHF) and its components, and compares the PPO and GRPO algorithms, highlighting their suitable engineering scenarios and practical implications for AI applications.

AI model trainingGRPOPPO
0 likes · 15 min read
Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jan 2, 2025 · Artificial Intelligence

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Xiaohongshu’s team unveiled a self‑developed RLHF system that trains multimodal large language models using heterogeneous and homogeneous network architectures, extensive PPO optimizations, and Medusa speculative sampling, achieving over 50% throughput gains, reduced hardware needs, and 5‑20% performance improvements on zero‑shot benchmarks.

MedusaPPOPRM
0 likes · 21 min read
Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance
Python Programming Learning Circle
Python Programming Learning Circle
Sep 10, 2024 · Artificial Intelligence

Using TorchRL to Implement Multi‑Agent PPO for MARL

This tutorial explains how to set up a multi‑agent reinforcement learning (MARL) environment with VMAS, install required dependencies, configure PPO hyper‑parameters, build policy and critic networks, collect data with TorchRL, and run a training loop to train agents for coordinated navigation tasks.

Multi-Agent Reinforcement LearningPPOPython
0 likes · 10 min read
Using TorchRL to Implement Multi‑Agent PPO for MARL
IT Architects Alliance
IT Architects Alliance
Feb 23, 2023 · Artificial Intelligence

Training a Positive Review Generator with RLHF and PPO

This article demonstrates how to use Reinforcement Learning from Human Feedback (RLHF) with a PPO algorithm and a sentiment‑analysis model to train a language model that generates positive product reviews, covering task definition, data sampling, reward evaluation, model optimization, and experimental results.

GPTPPORLHF
0 likes · 11 min read
Training a Positive Review Generator with RLHF and PPO
Architect
Architect
Feb 19, 2023 · Artificial Intelligence

Training a Positive Review Generator with RLHF and PPO

This article demonstrates how to apply Reinforcement Learning from Human Feedback (RLHF) using a sentiment‑analysis model as a reward function and Proximal Policy Optimization (PPO) to fine‑tune a language model that generates positive product reviews, complete with code snippets and experimental results.

PPORLHFlanguage model
0 likes · 10 min read
Training a Positive Review Generator with RLHF and PPO
IT Architects Alliance
IT Architects Alliance
Feb 9, 2023 · Artificial Intelligence

How ChatGPT Works: Model Architecture, Training Strategies, and RLHF

ChatGPT, OpenAI’s latest language model, builds on GPT‑3 using supervised fine‑tuning and Reinforcement Learning from Human Feedback (RLHF) with PPO, addressing consistency issues by aligning model outputs with human preferences, while discussing training methods, limitations, and evaluation metrics.

AI alignmentChatGPTPPO
0 likes · 15 min read
How ChatGPT Works: Model Architecture, Training Strategies, and RLHF
Architects' Tech Alliance
Architects' Tech Alliance
Feb 7, 2023 · Artificial Intelligence

ChatGPT: Technical Principles, Architecture, and the Role of Human‑Feedback Reinforcement Learning

This article explains how ChatGPT builds on GPT‑3 with improved accuracy and coherence, details its training pipeline that combines supervised fine‑tuning and Reinforcement Learning from Human Feedback (RLHF), discusses consistency challenges, evaluation metrics, and the limitations of the RLHF approach.

AI alignmentChatGPTPPO
0 likes · 15 min read
ChatGPT: Technical Principles, Architecture, and the Role of Human‑Feedback Reinforcement Learning
Architect
Architect
Feb 6, 2023 · Artificial Intelligence

Understanding How ChatGPT Works: RLHF, PPO, and Consistency Challenges

This article explains the underlying mechanisms of ChatGPT, including its GPT‑3 foundation, the role of supervised fine‑tuning, human‑feedback reinforcement learning (RLHF), PPO optimization, consistency issues, evaluation metrics, and the limitations of these training strategies, with references to key research papers.

AI alignmentChatGPTPPO
0 likes · 16 min read
Understanding How ChatGPT Works: RLHF, PPO, and Consistency Challenges
Tencent Cloud Developer
Tencent Cloud Developer
Dec 9, 2022 · Artificial Intelligence

An Overview of ChatGPT: Technology, Training Process, and Applications

The article outlines ChatGPT’s conversational capabilities, its InstructGPT‑based architecture, a three‑stage RLHF training pipeline involving supervised fine‑tuning, human‑ranked response generation, and PPO optimization, and discusses its strengths, limitations, diverse applications, and future directions for multimodal, up‑to‑date assistants.

AI applicationsChatGPTPPO
0 likes · 18 min read
An Overview of ChatGPT: Technology, Training Process, and Applications
IEG Growth Platform Technology Team
IEG Growth Platform Technology Team
Dec 6, 2021 · Artificial Intelligence

Model-Free Reinforcement Learning for ROI Optimization: Methods, Advertising Applications, and Tencent Game Advertising Practice

This article introduces model‑free reinforcement learning fundamentals, reviews mainstream solution methods such as Monte‑Carlo, Temporal‑Difference, and n‑step TD with eligibility traces, discusses their application in online advertising and presents Tencent's game advertising practice, including algorithm choices, reward design, and experimental results.

A3CPPOROI optimization
0 likes · 17 min read
Model-Free Reinforcement Learning for ROI Optimization: Methods, Advertising Applications, and Tencent Game Advertising Practice
DataFunTalk
DataFunTalk
Oct 4, 2020 · Artificial Intelligence

Reinforcement Learning for Product Ranking: Model Design, Experiments, and Online Deployment

This article presents a comprehensive study of using reinforcement learning to improve e‑commerce product ranking, covering the limitations of traditional scoring, the design of context‑aware models, a pointer‑network based sequence generator, various RL algorithms, extensive offline evaluations, and successful online deployment with future research directions.

PPOdeep learninge-commerce
0 likes · 28 min read
Reinforcement Learning for Product Ranking: Model Design, Experiments, and Online Deployment