Tagged articles
49 articles
Page 1 of 1
DataFunTalk
DataFunTalk
May 10, 2026 · Artificial Intelligence

DeepSeek vs MCTS: Decoding the ‘Chicken & Liquor’ Dilemma in LLM Training

The article analyzes why DeepSeek’s large‑model training struggles with Monte‑Carlo Tree Search, explains its use of Chain‑of‑Thought prompting, GRPO entropy‑boosting and rejection‑sampling fine‑tuning, compares these methods with Google’s OmegaPRM and PRM approaches, and proposes a concrete MCTS‑driven data‑generation pipeline to overcome the “chicken and liquor” trade‑off.

DeepSeekGRPOMonte Carlo Tree Search
0 likes · 14 min read
DeepSeek vs MCTS: Decoding the ‘Chicken & Liquor’ Dilemma in LLM Training
DataFunSummit
DataFunSummit
May 4, 2026 · Artificial Intelligence

DeepSeek’s MCTS Failure: The ‘Roast Chicken and Baijiu’ Dilemma in LLM Training

The article examines why DeepSeek’s large‑model training cannot yet leverage Monte‑Carlo Tree Search, detailing its reliance on SFT, GRPO‑driven CoT activation and rejection‑sampling, contrasting this with Google’s PRM‑based approaches, and proposing a MCTS‑powered data‑generation pipeline to overcome the “roast chicken and baijiu” training dilemma.

GRPOMonte Carlo Tree SearchProcess Reward Model
0 likes · 14 min read
DeepSeek’s MCTS Failure: The ‘Roast Chicken and Baijiu’ Dilemma in LLM Training
Machine Heart
Machine Heart
May 1, 2026 · Artificial Intelligence

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

This article surveys the rapid evolution of reinforcement‑learning algorithms for large‑language‑model inference from early REINFORCE and PPO to newer approaches such as GRPO, RLOO, DAPO, CISPO, DPPO, ScaleRL and MaxRL, highlighting their design motivations, mathematical formulations, empirical trade‑offs and open research challenges.

GRPOLLMMaxRL
0 likes · 27 min read
From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Apr 16, 2026 · Interview Experience

Turn Memorized Answers into Deep Understanding for Tech Interviews

This article explains why interviewers use seemingly rote questions to probe a candidate's true grasp of concepts, contrasts memorization with genuine understanding using PPO vs GRPO, and provides a practical three‑question framework and dialogue examples to help candidates answer technical principle questions confidently.

Answering TechniquesGRPOPPO
0 likes · 12 min read
Turn Memorized Answers into Deep Understanding for Tech Interviews
Data Party THU
Data Party THU
Apr 12, 2026 · Artificial Intelligence

What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends

This article systematically reviews the core post‑training techniques for large language models—including supervised fine‑tuning, RLHF, PPO, GRPO, DPO, RLVR and Agentic RL—explains their evolution, compares their trade‑offs, and highlights the most promising research directions for 2025‑2026.

AI AlignmentGRPOLLM
0 likes · 20 min read
What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 2, 2026 · Artificial Intelligence

Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution

This article explains the full lifecycle of large language models in 2026, covering pretraining fundamentals, the limits of classic Scaling Laws, data‑centric advances, fine‑tuning strategies, RLHF, DPO, and the emerging post‑training methods GRPO, DAPO and RLVR, with concrete benchmarks and cost analyses.

DAPODPOFine-tuning
0 likes · 17 min read
Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 17, 2026 · Artificial Intelligence

MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning

A new MIT paper reveals that pretrained large models already contain many hidden expert submodels, and that a simple one‑step Gaussian perturbation (RandOpt) can locate and ensemble these experts to achieve performance comparable to or better than traditional GRPO/PPO tuning, especially as model size grows.

GRPOModel ScalingPPO
0 likes · 9 min read
MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 1, 2026 · Artificial Intelligence

From Traditional RL to LLM RL: Theory Derivation and Practical Engineering Improvements

This article walks through the fundamental derivation of policy‑based reinforcement learning, explains how traditional RL concepts extend to large‑language‑model RL, and details engineering enhancements such as GRPO memory reduction, asynchronous rollout, importance‑sampling corrections, and token‑flow management for stable industrial‑scale training.

Asynchronous RolloutGRPOImportance Sampling
0 likes · 11 min read
From Traditional RL to LLM RL: Theory Derivation and Practical Engineering Improvements
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 24, 2026 · Artificial Intelligence

From Traditional RL to LLM‑RL: Theory Derivation and Engineering Improvements

The article walks through the fundamentals of traditional policy‑gradient reinforcement learning, derives the Reinforce objective, maps its concepts to large‑language‑model RL, and then discusses practical engineering solutions such as GRPO, async rollout, importance‑sampling corrections, and token‑flow management for industrial‑scale training.

Async RolloutGRPOImportance Sampling
0 likes · 10 min read
From Traditional RL to LLM‑RL: Theory Derivation and Engineering Improvements
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 24, 2026 · Artificial Intelligence

What Advances Do GRPO, DAPO, GSPO, and SAPO Bring Over PPO?

After DPO, the typical research trajectory moves through GRPO, DAPO, GSPO, and SAPO, each introducing new optimization objectives, sampling strategies, and reward‑shaping techniques that aim to reduce memory usage, improve gradient stability, and enhance the efficiency of large‑model reinforcement learning.

DAPOGRPOGSPO
0 likes · 6 min read
What Advances Do GRPO, DAPO, GSPO, and SAPO Bring Over PPO?
AI Engineering
AI Engineering
Jan 10, 2026 · Artificial Intelligence

Teaching LLMs to Manage Memory Autonomously, Dropping Manual Rules

Alibaba's new AgeMem framework turns long‑term and short‑term memory management for large language model agents into a learnable reinforcement‑learning task, replacing handcrafted rules with a three‑stage training process and achieving significant benchmark gains.

AgeMemGRPOLLM
0 likes · 9 min read
Teaching LLMs to Manage Memory Autonomously, Dropping Manual Rules
Fun with Large Models
Fun with Large Models
Dec 5, 2025 · Artificial Intelligence

DeepSeek Math V2 & V3.2: A Plain‑Language Deep Dive into Core Innovations

This article provides a detailed, easy‑to‑understand analysis of DeepSeek‑Math‑V2’s self‑verification training method and DeepSeek‑V3.2’s GRPO framework, sparse‑attention DSA mechanism, massive agent data pipeline, and benchmark results that place both models among the world’s top open‑source large language models.

DeepSeekGRPOLLM
0 likes · 19 min read
DeepSeek Math V2 & V3.2: A Plain‑Language Deep Dive into Core Innovations
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Dec 4, 2025 · Artificial Intelligence

Paper Review: RETuning Boosts Large‑Model Stock Trend Prediction Reasoning

This article analyzes the RETuning framework, which addresses LLMs' bias toward analyst opinions and lack of evidence weighting in stock movement prediction by introducing a two‑stage cold‑start fine‑tuning and reinforcement learning pipeline, evaluating it on the large Fin‑2024 dataset and demonstrating significant F1 gains, inference‑time scaling, and out‑of‑distribution robustness.

Fin-2024GRPOInference Scaling
0 likes · 12 min read
Paper Review: RETuning Boosts Large‑Model Stock Trend Prediction Reasoning
ShiZhen AI
ShiZhen AI
Nov 28, 2025 · Artificial Intelligence

DeepSeekMath‑V2 Scores 118/120 on Putnam and Achieves Gold‑Level IMO Performance

DeepSeekMath‑V2, released open‑source on 27 Nov 2025, attains gold‑level results on IMO 2025, scores 118 out of 120 on the Putnam 2024 competition, introduces a generator‑verifier self‑verification architecture, uses GRPO training, and outperforms leading closed‑source models on IMO‑ProofBench.

DeepSeekMath-V2GRPOLLM
0 likes · 7 min read
DeepSeekMath‑V2 Scores 118/120 on Putnam and Achieves Gold‑Level IMO Performance
360 Smart Cloud
360 Smart Cloud
Nov 14, 2025 · Artificial Intelligence

How TLM Platform Powers LLM Ops with PPO, GRPO and Reinforcement Evaluators

The article introduces the TLM large‑model development platform, details its fine‑tuning options, explains reinforcement learning fundamentals and key algorithms such as PPO and the newer GRPO, describes the architecture of a reinforcement evaluator, and shows how to configure RL training on the platform.

AI PlatformGRPOLLMOps
0 likes · 10 min read
How TLM Platform Powers LLM Ops with PPO, GRPO and Reinforcement Evaluators
Zhuanzhuan Tech
Zhuanzhuan Tech
Oct 29, 2025 · Artificial Intelligence

How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems

This article examines how reinforcement‑learning techniques such as PPO, DPO, and GRPO are integrated into the Baixiaosheng QA system to improve answer stability, deepen domain knowledge understanding, and accelerate response generation, and it evaluates the impact of Reinforcement Fine‑Tuning (RFT) on real‑world performance.

AIDPOGRPO
0 likes · 16 min read
How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems
Data Thinking Notes
Data Thinking Notes
Oct 19, 2025 · Artificial Intelligence

How GSPO Improves Stability in Large Language Model Training

GSPO (Group Sequence Policy Optimization) is a reinforcement‑learning algorithm for LLMs that replaces token‑level GRPO with sequence‑level optimization, addressing instability in ultra‑large model training, especially for long‑sequence and MoE architectures, by aligning reward granularity and reducing variance.

GRPOGSPOlarge language models
0 likes · 11 min read
How GSPO Improves Stability in Large Language Model Training
Fun with Large Models
Fun with Large Models
Sep 24, 2025 · Artificial Intelligence

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

The article explains the fundamental principles of PPO and GRPO reinforcement‑learning algorithms, compares their architectures and training workflows, highlights why GRPO is gaining traction in large‑model fine‑tuning, discusses associated risks, and offers practical guidance on group size selection for engineers preparing for interviews.

GRPOPPORLHF
0 likes · 9 min read
Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning
Data Party THU
Data Party THU
Sep 19, 2025 · Artificial Intelligence

How DeepSeek R1 Redefines AI Reasoning with Pure Reinforcement Learning

DeepSeek R1 replaces traditional supervised fine‑tuning with a pure reinforcement‑learning pipeline, introducing the GRPO algorithm and a four‑stage training regime that dramatically lowers cost, boosts reasoning and code‑generation performance, and raises important ethical, privacy, and societal considerations for large language models.

AI reasoningDeepSeekGRPO
0 likes · 14 min read
How DeepSeek R1 Redefines AI Reasoning with Pure Reinforcement Learning
DataFunTalk
DataFunTalk
Sep 18, 2025 · Artificial Intelligence

How DeepSeek‑R1’s Reinforcement Learning Earned a Nature Cover

DeepSeek‑R1, the first peer‑reviewed large language model, leveraged a pure reinforcement‑learning framework and the novel GRPO algorithm to achieve breakthrough reasoning performance, low training cost, and widespread acclaim, culminating in a Nature magazine cover story.

AI reasoningDeepSeekGRPO
0 likes · 14 min read
How DeepSeek‑R1’s Reinforcement Learning Earned a Nature Cover
Sohu Tech Products
Sohu Tech Products
Sep 10, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

This article explains the GRPO algorithm, an improvement over PPO for large language model training that eliminates the value network, uses group‑relative advantage estimation, and offers flexible supervision, resulting in higher efficiency, stability, and performance on tasks such as mathematical reasoning.

AI OptimizationGRPOLLM training
0 likes · 16 min read
How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models
Data Party THU
Data Party THU
Sep 4, 2025 · Artificial Intelligence

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

This article provides a comprehensive technical analysis of PPO‑based reinforcement learning methods for large language models, detailing the evolution from the original PPO algorithm through GRPO, DAPO, and GSPO, and explaining their motivations, mathematical formulations, advantages, and practical challenges such as entropy collapse and importance‑sampling variance.

DAPOGRPOGSPO
0 likes · 30 min read
Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive
Sohu Tech Products
Sohu Tech Products
Sep 3, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF for Large Language Models

This article explains the motivation, mathematical foundations, implementation details, advantages, experimental results, and future directions of Group Relative Policy Optimization (GRPO), a novel reinforcement‑learning algorithm that replaces PPO’s value network with efficient group‑wise relative evaluation for large language models.

GRPOLLMPPO
0 likes · 17 min read
How GRPO Revolutionizes RLHF for Large Language Models
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Aug 27, 2025 · Artificial Intelligence

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Perception‑R1, a post‑training framework that applies rule‑based reinforcement learning to existing multimodal LLMs, dramatically improves visual perception tasks such as grounding, OCR, counting and object detection, as demonstrated by extensive benchmarks and ablation studies.

GRPOPerception PolicyReward Modeling
0 likes · 10 min read
Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision
Data Party THU
Data Party THU
Aug 7, 2025 · Artificial Intelligence

Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability

The paper identifies that GRPO’s token‑level importance weighting introduces high‑variance noise causing instability in large‑scale language model RL training, and proposes GSPO, a sequence‑level importance sampling method that aligns with reward definitions, improves gradient stability, and yields higher training efficiency and better performance across benchmarks.

GRPOGSPORL
0 likes · 8 min read
Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability
Fun with Large Models
Fun with Large Models
Jun 12, 2025 · Artificial Intelligence

Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B

This article explains the GRPO reinforcement‑learning algorithm, shows its core idea of internal group competition without a separate evaluator model, and provides a complete, step‑by‑step code walkthrough—including environment setup, dataset preparation, reward‑function design, training configuration, and evaluation—using the Qwen2.5‑0.5B‑Instruct model on the GSM8K math dataset.

GRPOGSM8KQwen2.5
0 likes · 23 min read
Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B
AI Algorithm Path
AI Algorithm Path
Apr 13, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization for LLM Training

The article explains GRPO, a reinforcement‑learning algorithm that extends PPO with group sampling, no value network, dual penalties and KL regularisation, showing how it improves efficiency and stability when fine‑tuning large language models such as DeepSeek‑Math and DeepSeek‑R1.

DeepSeekGRPOPPO
0 likes · 6 min read
Understanding GRPO: Group Relative Policy Optimization for LLM Training
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 28, 2025 · Artificial Intelligence

Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO

This article critically examines R1‑Zero‑style training by analyzing foundation models and reinforcement learning, uncovering pre‑training and optimization biases, proposing an unbiased Dr. GRPO method, and demonstrating a minimalist 7B‑model recipe that achieves new state‑of‑the‑art performance on AIME 2024.

GRPOLLM evaluationR1-Zero
0 likes · 20 min read
Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 19, 2025 · Artificial Intelligence

Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?

The article explains why the GRPO loss in OpenR1 and trl starts at zero and then rises, detailing the underlying KL‑divergence formulation, the single‑step update mechanism, and how gradients are preserved despite a zero scalar loss, with code examples from the trl implementation.

GRPOLoss InitializationOpenR1
0 likes · 5 min read
Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?
Data Thinking Notes
Data Thinking Notes
Mar 16, 2025 · Artificial Intelligence

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

DeepSeek‑R1 replaces the traditional PPO‑based RLHF approach with GRPO, reducing reliance on human‑labeled data by using pure reinforcement learning environments and carefully designed reward mechanisms; the article explains reinforcement learning fundamentals, compares PPO, DPO and GRPO, and offers practical application recommendations.

AI AlignmentDPOGRPO
0 likes · 14 min read
Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives
Architect
Architect
Mar 16, 2025 · Artificial Intelligence

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

This article walks through the complete lifecycle of building a small large‑language model, covering token‑level inference, pre‑training, post‑training steps such as supervised fine‑tuning, reward‑model creation, and reinforcement‑learning methods like DPO, PPO and GRPO, culminating in a practical 0.5B model fine‑tuned for chain‑of‑thought reasoning.

GRPOLLM trainingReward Modeling
0 likes · 22 min read
Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 16, 2025 · Artificial Intelligence

Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?

This article details how a 7B parameter language model, fine‑tuned with DeepSeek's GRPO reinforcement‑learning algorithm and a carefully crafted multi‑component reward system, learned to solve Sudoku puzzles without any cold‑start data, outperforming a comparable 3B model and revealing key insights for structured reasoning tasks.

AI trainingGRPOQwen
0 likes · 15 min read
Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?
Architect
Architect
Mar 10, 2025 · Artificial Intelligence

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

This article analyzes DeepSeek’s latest large‑model breakthroughs, covering the MLA attention compression, GRPO alignment algorithm, MoE load‑balancing redesign, multi‑stage training pipelines, reinforcement‑learning tricks, and performance comparisons with GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

AI trainingDeepSeekGRPO
0 likes · 19 min read
What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations
DataFunTalk
DataFunTalk
Mar 2, 2025 · Artificial Intelligence

Implementing GRPO from Scratch with Distributed Reinforcement Learning on Qwen2.5-1.5B-Instruct

This tutorial explains how to build a distributed reinforcement‑learning pipeline using the GRPO algorithm, covering data preparation, evaluation and reward functions, multi‑GPU DataParallel implementation, and full fine‑tuning of the Qwen2.5‑1.5B‑Instruct model with PyTorch, FlashAttention2 and Weights & Biases.

AIDistributed TrainingGRPO
0 likes · 10 min read
Implementing GRPO from Scratch with Distributed Reinforcement Learning on Qwen2.5-1.5B-Instruct
Tencent Technical Engineering
Tencent Technical Engineering
Feb 24, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models

The article reviews reinforcement-learning fundamentals and the progression from policy-gradient to PPO, then introduces Group Relative Policy Optimization (GRPO)—a critic-free method that normalizes rewards across multiple sampled outputs to compute group-relative advantages—and shows how DeepSeek-R1 leverages GRPO with rule-based rewards to achieve strong reasoning performance.

GRPOPPOPolicy Optimization
0 likes · 16 min read
Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models
DevOps
DevOps
Feb 23, 2025 · Artificial Intelligence

Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications

This article explains how DeepSeek‑R1‑Zero uses group‑relative policy optimization (GRPO) to enhance inference without labeled data, introduces reinforcement learning with human feedback (RLHF) and its components, and compares the PPO and GRPO algorithms, highlighting their suitable engineering scenarios and practical implications for AI applications.

AI model trainingDeep LearningGRPO
0 likes · 15 min read
Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications
Tencent Technical Engineering
Tencent Technical Engineering
Feb 21, 2025 · Artificial Intelligence

DeepSeek-R1: Enhancing Reasoning Capabilities in LLMs via Reinforcement Learning

DeepSeek‑R1 demonstrates that large‑scale reinforcement learning, especially with the novel Group Relative Policy Optimization and a rule‑based reward scheme, can markedly boost reasoning in LLMs without heavy supervised fine‑tuning, while a brief cold‑start SFT phase, two‑stage alignment, and knowledge distillation further improve performance and efficiency, despite remaining challenges such as language mixing.

DeepSeek-R1GRPOLLM Reasoning
0 likes · 21 min read
DeepSeek-R1: Enhancing Reasoning Capabilities in LLMs via Reinforcement Learning
Cognitive Technology Team
Cognitive Technology Team
Feb 3, 2025 · Artificial Intelligence

DeepSeek R1 Introduces Group‑Related Policy Optimization for Advanced Reasoning in Large Language Models

DeepSeek AI’s new open‑source model DeepSeek‑R1 leverages a novel Group‑Related Policy Optimization (GRPO) reinforcement‑learning framework and multi‑stage training to dramatically boost complex reasoning performance, achieving AIME 2024 Pass@1 scores comparable to OpenAI’s o1 model.

AIDeepSeekGRPO
0 likes · 4 min read
DeepSeek R1 Introduces Group‑Related Policy Optimization for Advanced Reasoning in Large Language Models