Tagged articles

49 articles

Page 1 of 1

May 10, 2026 · Artificial Intelligence

DeepSeek vs MCTS: Decoding the ‘Chicken & Liquor’ Dilemma in LLM Training

The article analyzes why DeepSeek’s large‑model training struggles with Monte‑Carlo Tree Search, explains its use of Chain‑of‑Thought prompting, GRPO entropy‑boosting and rejection‑sampling fine‑tuning, compares these methods with Google’s OmegaPRM and PRM approaches, and proposes a concrete MCTS‑driven data‑generation pipeline to overcome the “chicken and liquor” trade‑off.

DeepSeekGRPOMonte Carlo Tree Search

0 likes · 14 min read

DeepSeek vs MCTS: Decoding the ‘Chicken & Liquor’ Dilemma in LLM Training

DataFunSummit

May 4, 2026 · Artificial Intelligence

DeepSeek’s MCTS Failure: The ‘Roast Chicken and Baijiu’ Dilemma in LLM Training

The article examines why DeepSeek’s large‑model training cannot yet leverage Monte‑Carlo Tree Search, detailing its reliance on SFT, GRPO‑driven CoT activation and rejection‑sampling, contrasting this with Google’s PRM‑based approaches, and proposing a MCTS‑powered data‑generation pipeline to overcome the “roast chicken and baijiu” training dilemma.

GRPOMonte Carlo Tree SearchProcess Reward Model

0 likes · 14 min read

DeepSeek’s MCTS Failure: The ‘Roast Chicken and Baijiu’ Dilemma in LLM Training

Machine Heart

May 1, 2026 · Artificial Intelligence

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

This article surveys the rapid evolution of reinforcement‑learning algorithms for large‑language‑model inference from early REINFORCE and PPO to newer approaches such as GRPO, RLOO, DAPO, CISPO, DPPO, ScaleRL and MaxRL, highlighting their design motivations, mathematical formulations, empirical trade‑offs and open research challenges.

GRPOLLMMaxRL

0 likes · 27 min read

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

Wu Shixiong's Large Model Academy

Apr 16, 2026 · Interview Experience

Turn Memorized Answers into Deep Understanding for Tech Interviews

This article explains why interviewers use seemingly rote questions to probe a candidate's true grasp of concepts, contrasts memorization with genuine understanding using PPO vs GRPO, and provides a practical three‑question framework and dialogue examples to help candidates answer technical principle questions confidently.

Answering TechniquesGRPOPPO

0 likes · 12 min read

Turn Memorized Answers into Deep Understanding for Tech Interviews

Data Party THU

Apr 12, 2026 · Artificial Intelligence

What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends

This article systematically reviews the core post‑training techniques for large language models—including supervised fine‑tuning, RLHF, PPO, GRPO, DPO, RLVR and Agentic RL—explains their evolution, compares their trade‑offs, and highlights the most promising research directions for 2025‑2026.

AI AlignmentGRPOLLM

0 likes · 20 min read

What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends

Lao Guo's Learning Space

Apr 2, 2026 · Artificial Intelligence

Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution

This article explains the full lifecycle of large language models in 2026, covering pretraining fundamentals, the limits of classic Scaling Laws, data‑centric advances, fine‑tuning strategies, RLHF, DPO, and the emerging post‑training methods GRPO, DAPO and RLVR, with concrete benchmarks and cost analyses.

DAPODPOFine-tuning

0 likes · 17 min read

Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution

Machine Learning Algorithms & Natural Language Processing

Mar 28, 2026 · Artificial Intelligence

A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL

This article systematically explains the post‑training pipeline for large language models, covering supervised fine‑tuning, RLHF, PPO, GRPO, RLVR, DPO and emerging Agentic RL, while illustrating each method with analogies, detailed workflows, tables, and recent research findings.

DPOGRPOLLM

0 likes · 24 min read

A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL

Machine Learning Algorithms & Natural Language Processing

Mar 17, 2026 · Artificial Intelligence

MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning

A new MIT paper reveals that pretrained large models already contain many hidden expert submodels, and that a simple one‑step Gaussian perturbation (RandOpt) can locate and ensemble these experts to achieve performance comparable to or better than traditional GRPO/PPO tuning, especially as model size grows.

GRPOModel ScalingPPO

0 likes · 9 min read

MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning

Machine Learning Algorithms & Natural Language Processing

Mar 1, 2026 · Artificial Intelligence

From Traditional RL to LLM RL: Theory Derivation and Practical Engineering Improvements

This article walks through the fundamental derivation of policy‑based reinforcement learning, explains how traditional RL concepts extend to large‑language‑model RL, and details engineering enhancements such as GRPO memory reduction, asynchronous rollout, importance‑sampling corrections, and token‑flow management for stable industrial‑scale training.

Asynchronous RolloutGRPOImportance Sampling

0 likes · 11 min read

From Traditional RL to LLM RL: Theory Derivation and Practical Engineering Improvements

Machine Learning Algorithms & Natural Language Processing

Feb 24, 2026 · Artificial Intelligence

From Traditional RL to LLM‑RL: Theory Derivation and Engineering Improvements

The article walks through the fundamentals of traditional policy‑gradient reinforcement learning, derives the Reinforce objective, maps its concepts to large‑language‑model RL, and then discusses practical engineering solutions such as GRPO, async rollout, importance‑sampling corrections, and token‑flow management for industrial‑scale training.

Async RolloutGRPOImportance Sampling

0 likes · 10 min read

From Traditional RL to LLM‑RL: Theory Derivation and Engineering Improvements

Baobao Algorithm Notes

Jan 24, 2026 · Artificial Intelligence

What Advances Do GRPO, DAPO, GSPO, and SAPO Bring Over PPO?

After DPO, the typical research trajectory moves through GRPO, DAPO, GSPO, and SAPO, each introducing new optimization objectives, sampling strategies, and reward‑shaping techniques that aim to reduce memory usage, improve gradient stability, and enhance the efficiency of large‑model reinforcement learning.

DAPOGRPOGSPO

0 likes · 6 min read

What Advances Do GRPO, DAPO, GSPO, and SAPO Bring Over PPO?

Baobao Algorithm Notes

Jan 16, 2026 · Artificial Intelligence

From PPO to SAPO: Evolution of Large‑Model Reinforcement Learning Algorithms

This article systematically reviews the main reinforcement‑learning algorithms—PPO, GRPO, DAPO, GSPO, and SAPO—used for fine‑tuning large language models, explaining why supervised fine‑tuning precedes RL, how each method improves training efficiency and stability, and what trade‑offs they entail.

GRPOPPORL

0 likes · 15 min read

From PPO to SAPO: Evolution of Large‑Model Reinforcement Learning Algorithms

AI Engineering

Jan 10, 2026 · Artificial Intelligence

Teaching LLMs to Manage Memory Autonomously, Dropping Manual Rules

Alibaba's new AgeMem framework turns long‑term and short‑term memory management for large language model agents into a learnable reinforcement‑learning task, replacing handcrafted rules with a three‑stage training process and achieving significant benchmark gains.

AgeMemGRPOLLM

0 likes · 9 min read

Teaching LLMs to Manage Memory Autonomously, Dropping Manual Rules

Wu Shixiong's Large Model Academy

Dec 10, 2025 · Artificial Intelligence

Why RLHF Success Relies on Data Engineering, Not Just Model Tricks

The article explains that the real difficulty of RLHF lies in designing and curating high‑quality preference data, building robust reward models through bad‑case rewriting, human‑in‑the‑loop labeling, and inference‑based reward modeling, while algorithmic details like PPO are secondary concerns.

GRPORLHFRM-R1

0 likes · 9 min read

Why RLHF Success Relies on Data Engineering, Not Just Model Tricks

Fun with Large Models

Dec 5, 2025 · Artificial Intelligence

DeepSeek Math V2 & V3.2: A Plain‑Language Deep Dive into Core Innovations

This article provides a detailed, easy‑to‑understand analysis of DeepSeek‑Math‑V2’s self‑verification training method and DeepSeek‑V3.2’s GRPO framework, sparse‑attention DSA mechanism, massive agent data pipeline, and benchmark results that place both models among the world’s top open‑source large language models.

DeepSeekGRPOLLM

0 likes · 19 min read

DeepSeek Math V2 & V3.2: A Plain‑Language Deep Dive into Core Innovations

Bighead's Algorithm Notes

Dec 4, 2025 · Artificial Intelligence

Paper Review: RETuning Boosts Large‑Model Stock Trend Prediction Reasoning

This article analyzes the RETuning framework, which addresses LLMs' bias toward analyst opinions and lack of evidence weighting in stock movement prediction by introducing a two‑stage cold‑start fine‑tuning and reinforcement learning pipeline, evaluating it on the large Fin‑2024 dataset and demonstrating significant F1 gains, inference‑time scaling, and out‑of‑distribution robustness.

Fin-2024GRPOInference Scaling

0 likes · 12 min read

Paper Review: RETuning Boosts Large‑Model Stock Trend Prediction Reasoning

ShiZhen AI

Nov 28, 2025 · Artificial Intelligence

DeepSeekMath‑V2 Scores 118/120 on Putnam and Achieves Gold‑Level IMO Performance

DeepSeekMath‑V2, released open‑source on 27 Nov 2025, attains gold‑level results on IMO 2025, scores 118 out of 120 on the Putnam 2024 competition, introduces a generator‑verifier self‑verification architecture, uses GRPO training, and outperforms leading closed‑source models on IMO‑ProofBench.

DeepSeekMath-V2GRPOLLM

0 likes · 7 min read

DeepSeekMath‑V2 Scores 118/120 on Putnam and Achieves Gold‑Level IMO Performance

360 Smart Cloud

Nov 14, 2025 · Artificial Intelligence

How TLM Platform Powers LLM Ops with PPO, GRPO and Reinforcement Evaluators

The article introduces the TLM large‑model development platform, details its fine‑tuning options, explains reinforcement learning fundamentals and key algorithms such as PPO and the newer GRPO, describes the architecture of a reinforcement evaluator, and shows how to configure RL training on the platform.

AI PlatformGRPOLLMOps

0 likes · 10 min read

How TLM Platform Powers LLM Ops with PPO, GRPO and Reinforcement Evaluators

Zhuanzhuan Tech

Oct 29, 2025 · Artificial Intelligence

How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems

This article examines how reinforcement‑learning techniques such as PPO, DPO, and GRPO are integrated into the Baixiaosheng QA system to improve answer stability, deepen domain knowledge understanding, and accelerate response generation, and it evaluates the impact of Reinforcement Fine‑Tuning (RFT) on real‑world performance.

AIDPOGRPO

0 likes · 16 min read

How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems

Data Thinking Notes

Oct 19, 2025 · Artificial Intelligence

How GSPO Improves Stability in Large Language Model Training

GSPO (Group Sequence Policy Optimization) is a reinforcement‑learning algorithm for LLMs that replaces token‑level GRPO with sequence‑level optimization, addressing instability in ultra‑large model training, especially for long‑sequence and MoE architectures, by aligning reward granularity and reducing variance.

GRPOGSPOlarge language models

0 likes · 11 min read

How GSPO Improves Stability in Large Language Model Training

Bilibili Tech

Oct 17, 2025 · Artificial Intelligence

How Bilibili’s Multimodal Team Won 2nd Place at ICCV MIPI with a Novel SFT+GRPO Strategy

This article details how Bilibili’s multimedia lab leveraged a multimodal training pipeline combining data‑compressed SFT and the GRPO reinforcement‑learning algorithm to achieve a 13.5% metric boost and secure second place in the ICCV MIPI Detailed Image Quality Assessment competition.

GRPOMIPI competitionSFT

0 likes · 15 min read

How Bilibili’s Multimodal Team Won 2nd Place at ICCV MIPI with a Novel SFT+GRPO Strategy

DataFunTalk

Oct 5, 2025 · Artificial Intelligence

How Shopify Built a Production‑Ready AI Agent Platform and Avoided Common Pitfalls

Shopify’s engineering team explains how they transformed the Sidekick AI assistant from a simple tool‑calling system into a robust, production‑grade AI agent platform, sharing architectural, evaluation and training lessons to help others avoid common pitfalls.

AI agentsGRPOJust-in-Time instructions

0 likes · 12 min read

How Shopify Built a Production‑Ready AI Agent Platform and Avoided Common Pitfalls

DataFunTalk

Sep 30, 2025 · Artificial Intelligence

DeepSeek‑V3.2‑Exp Unveiled: Million‑Token Context, Sparse Attention, and Cost‑Effective Inference

DeepSeek‑V3.2‑Exp, the latest experimental large‑language model, is open‑sourced with a paper, featuring a million‑token context window, a new sparse attention mechanism, GRPO‑enhanced reasoning, and detailed cost‑analysis showing up to ten‑fold inference savings.

DeepSeekGRPOInference Optimization

0 likes · 5 min read

DeepSeek‑V3.2‑Exp Unveiled: Million‑Token Context, Sparse Attention, and Cost‑Effective Inference

Fun with Large Models

Sep 24, 2025 · Artificial Intelligence

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

The article explains the fundamental principles of PPO and GRPO reinforcement‑learning algorithms, compares their architectures and training workflows, highlights why GRPO is gaining traction in large‑model fine‑tuning, discusses associated risks, and offers practical guidance on group size selection for engineers preparing for interviews.

GRPOPPORLHF

0 likes · 9 min read

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

Data Party THU

Sep 19, 2025 · Artificial Intelligence

How DeepSeek R1 Redefines AI Reasoning with Pure Reinforcement Learning

DeepSeek R1 replaces traditional supervised fine‑tuning with a pure reinforcement‑learning pipeline, introducing the GRPO algorithm and a four‑stage training regime that dramatically lowers cost, boosts reasoning and code‑generation performance, and raises important ethical, privacy, and societal considerations for large language models.

AI reasoningDeepSeekGRPO

0 likes · 14 min read

How DeepSeek R1 Redefines AI Reasoning with Pure Reinforcement Learning

Data Party THU

Sep 18, 2025 · Artificial Intelligence

How DeepSeek‑R1’s Reinforcement Learning Redefined LLM Reasoning (Nature Cover Story)

DeepSeek‑R1, the first peer‑reviewed large language model, landed on Nature’s cover after a novel reinforcement‑learning‑only training pipeline that dramatically boosted reasoning performance while keeping training costs surprisingly low.

DeepSeekGRPOModel Training

0 likes · 14 min read

How DeepSeek‑R1’s Reinforcement Learning Redefined LLM Reasoning (Nature Cover Story)

DataFunTalk

Sep 18, 2025 · Artificial Intelligence

How DeepSeek‑R1’s Reinforcement Learning Earned a Nature Cover

DeepSeek‑R1, the first peer‑reviewed large language model, leveraged a pure reinforcement‑learning framework and the novel GRPO algorithm to achieve breakthrough reasoning performance, low training cost, and widespread acclaim, culminating in a Nature magazine cover story.

AI reasoningDeepSeekGRPO

0 likes · 14 min read

How DeepSeek‑R1’s Reinforcement Learning Earned a Nature Cover

Sohu Tech Products

Sep 10, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

This article explains the GRPO algorithm, an improvement over PPO for large language model training that eliminates the value network, uses group‑relative advantage estimation, and offers flexible supervision, resulting in higher efficiency, stability, and performance on tasks such as mathematical reasoning.

AI OptimizationGRPOLLM training

0 likes · 16 min read

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

Data Party THU

Sep 4, 2025 · Artificial Intelligence

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

This article provides a comprehensive technical analysis of PPO‑based reinforcement learning methods for large language models, detailing the evolution from the original PPO algorithm through GRPO, DAPO, and GSPO, and explaining their motivations, mathematical formulations, advantages, and practical challenges such as entropy collapse and importance‑sampling variance.

DAPOGRPOGSPO

0 likes · 30 min read

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

Sohu Tech Products

Sep 3, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF for Large Language Models

This article explains the motivation, mathematical foundations, implementation details, advantages, experimental results, and future directions of Group Relative Policy Optimization (GRPO), a novel reinforcement‑learning algorithm that replaces PPO’s value network with efficient group‑wise relative evaluation for large language models.

GRPOLLMPPO

0 likes · 17 min read

How GRPO Revolutionizes RLHF for Large Language Models

Network Intelligence Research Center (NIRC)

Aug 27, 2025 · Artificial Intelligence

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Perception‑R1, a post‑training framework that applies rule‑based reinforcement learning to existing multimodal LLMs, dramatically improves visual perception tasks such as grounding, OCR, counting and object detection, as demonstrated by extensive benchmarks and ablation studies.

GRPOPerception PolicyReward Modeling

0 likes · 10 min read

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Data Party THU

Aug 7, 2025 · Artificial Intelligence

Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability

The paper identifies that GRPO’s token‑level importance weighting introduces high‑variance noise causing instability in large‑scale language model RL training, and proposes GSPO, a sequence‑level importance sampling method that aligns with reward definitions, improves gradient stability, and yields higher training efficiency and better performance across benchmarks.

GRPOGSPORL

0 likes · 8 min read

Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability

Alibaba Cloud Big Data AI Platform

Jul 16, 2025 · Artificial Intelligence

Master Post-Training: Fine-Tune LLMs with SFT, DPO, and GRPO on Alibaba PAI

This article explains post‑training concepts, compares SFT, DPO, and GRPO fine‑tuning methods, and provides step‑by‑step guidance for using Alibaba Cloud's PAI platform—including Model Gallery and DSW—to fine‑tune large language models with code examples and practical tips.

DPOFine-tuningGRPO

0 likes · 14 min read

Master Post-Training: Fine-Tune LLMs with SFT, DPO, and GRPO on Alibaba PAI

Fun with Large Models

Jun 12, 2025 · Artificial Intelligence

Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B

This article explains the GRPO reinforcement‑learning algorithm, shows its core idea of internal group competition without a separate evaluator model, and provides a complete, step‑by‑step code walkthrough—including environment setup, dataset preparation, reward‑function design, training configuration, and evaluation—using the Qwen2.5‑0.5B‑Instruct model on the GSM8K math dataset.

GRPOGSM8KQwen2.5

0 likes · 23 min read

Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B

AI Algorithm Path

Apr 13, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization for LLM Training

The article explains GRPO, a reinforcement‑learning algorithm that extends PPO with group sampling, no value network, dual penalties and KL regularisation, showing how it improves efficiency and stability when fine‑tuning large language models such as DeepSeek‑Math and DeepSeek‑R1.

DeepSeekGRPOPPO

0 likes · 6 min read

Understanding GRPO: Group Relative Policy Optimization for LLM Training

Baobao Algorithm Notes

Mar 28, 2025 · Artificial Intelligence

Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO

This article critically examines R1‑Zero‑style training by analyzing foundation models and reinforcement learning, uncovering pre‑training and optimization biases, proposing an unbiased Dr. GRPO method, and demonstrating a minimalist 7B‑model recipe that achieves new state‑of‑the‑art performance on AIME 2024.

GRPOLLM evaluationR1-Zero

0 likes · 20 min read

Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO

Baobao Algorithm Notes

Mar 27, 2025 · Artificial Intelligence

Why a Robust Training Pipeline Beats Fancy LLM Tricks – Lessons from DAPO

The article analyzes the DAPO technical report, showing how dynamic‑sampling pipelines and token‑level loss handling in SFT and RL training outperform ad‑hoc algorithm tricks, and compares the training dynamics of reinforce_baseline and GRPO with concrete code examples.

Dynamic SamplingGRPOLLM

0 likes · 8 min read

Why a Robust Training Pipeline Beats Fancy LLM Tricks – Lessons from DAPO

Baobao Algorithm Notes

Mar 19, 2025 · Artificial Intelligence

Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?

The article explains why the GRPO loss in OpenR1 and trl starts at zero and then rises, detailing the underlying KL‑divergence formulation, the single‑step update mechanism, and how gradients are preserved despite a zero scalar loss, with code examples from the trl implementation.

GRPOLoss InitializationOpenR1

0 likes · 5 min read

Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?

Architect

Mar 17, 2025 · Artificial Intelligence

Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons

This article details a reinforcement‑learning experiment that teaches 7B‑ and 3B‑parameter language models to solve Sudoku, covering data preparation, GRPO‑based reward design, training configurations, performance comparisons, key insights, and future research directions.

GRPOModel Scalinglanguage models

0 likes · 15 min read

Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons

Data Thinking Notes

Mar 16, 2025 · Artificial Intelligence

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

DeepSeek‑R1 replaces the traditional PPO‑based RLHF approach with GRPO, reducing reliance on human‑labeled data by using pure reinforcement learning environments and carefully designed reward mechanisms; the article explains reinforcement learning fundamentals, compares PPO, DPO and GRPO, and offers practical application recommendations.

AI AlignmentDPOGRPO

0 likes · 14 min read

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

Architect

Mar 16, 2025 · Artificial Intelligence

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

This article walks through the complete lifecycle of building a small large‑language model, covering token‑level inference, pre‑training, post‑training steps such as supervised fine‑tuning, reward‑model creation, and reinforcement‑learning methods like DPO, PPO and GRPO, culminating in a practical 0.5B model fine‑tuned for chain‑of‑thought reasoning.

GRPOLLM trainingReward Modeling

0 likes · 22 min read

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

Baobao Algorithm Notes

Mar 16, 2025 · Artificial Intelligence

Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?

This article details how a 7B parameter language model, fine‑tuned with DeepSeek's GRPO reinforcement‑learning algorithm and a carefully crafted multi‑component reward system, learned to solve Sudoku puzzles without any cold‑start data, outperforming a comparable 3B model and revealing key insights for structured reasoning tasks.

AI trainingGRPOQwen

0 likes · 15 min read

Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?

Architect

Mar 10, 2025 · Artificial Intelligence

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

This article analyzes DeepSeek’s latest large‑model breakthroughs, covering the MLA attention compression, GRPO alignment algorithm, MoE load‑balancing redesign, multi‑stage training pipelines, reinforcement‑learning tricks, and performance comparisons with GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

AI trainingDeepSeekGRPO

0 likes · 19 min read

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

DataFunTalk

Mar 2, 2025 · Artificial Intelligence

Implementing GRPO from Scratch with Distributed Reinforcement Learning on Qwen2.5-1.5B-Instruct

This tutorial explains how to build a distributed reinforcement‑learning pipeline using the GRPO algorithm, covering data preparation, evaluation and reward functions, multi‑GPU DataParallel implementation, and full fine‑tuning of the Qwen2.5‑1.5B‑Instruct model with PyTorch, FlashAttention2 and Weights & Biases.

AIDistributed TrainingGRPO

0 likes · 10 min read

Implementing GRPO from Scratch with Distributed Reinforcement Learning on Qwen2.5-1.5B-Instruct

Tencent Technical Engineering

Feb 24, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models

The article reviews reinforcement-learning fundamentals and the progression from policy-gradient to PPO, then introduces Group Relative Policy Optimization (GRPO)—a critic-free method that normalizes rewards across multiple sampled outputs to compute group-relative advantages—and shows how DeepSeek-R1 leverages GRPO with rule-based rewards to achieve strong reasoning performance.

GRPOPPOPolicy Optimization

0 likes · 16 min read

Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models

DevOps

Feb 23, 2025 · Artificial Intelligence

Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications

This article explains how DeepSeek‑R1‑Zero uses group‑relative policy optimization (GRPO) to enhance inference without labeled data, introduces reinforcement learning with human feedback (RLHF) and its components, and compares the PPO and GRPO algorithms, highlighting their suitable engineering scenarios and practical implications for AI applications.

AI model trainingDeep LearningGRPO

0 likes · 15 min read

Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications

Tencent Technical Engineering

Feb 21, 2025 · Artificial Intelligence

DeepSeek-R1: Enhancing Reasoning Capabilities in LLMs via Reinforcement Learning

DeepSeek‑R1 demonstrates that large‑scale reinforcement learning, especially with the novel Group Relative Policy Optimization and a rule‑based reward scheme, can markedly boost reasoning in LLMs without heavy supervised fine‑tuning, while a brief cold‑start SFT phase, two‑stage alignment, and knowledge distillation further improve performance and efficiency, despite remaining challenges such as language mixing.

DeepSeek-R1GRPOLLM Reasoning

0 likes · 21 min read

DeepSeek-R1: Enhancing Reasoning Capabilities in LLMs via Reinforcement Learning

Big Data Technology Architecture

Feb 9, 2025 · Artificial Intelligence

Reproducing Deepseek RI Reasoning Ability with GRPO on Qwen2.5‑7B in Colab

This article explains how to replicate Deepseek RI's slow‑thinking inference using the GRPO reinforcement‑learning algorithm on the Qwen2.5‑7B model in a free Colab notebook, covering the underlying COT concept, reward‑function design, data preparation, training configuration, and observed results.

Fine-tuningGRPOLLM

0 likes · 14 min read

Reproducing Deepseek RI Reasoning Ability with GRPO on Qwen2.5‑7B in Colab

Cognitive Technology Team

Feb 3, 2025 · Artificial Intelligence

DeepSeek R1 Introduces Group‑Related Policy Optimization for Advanced Reasoning in Large Language Models

DeepSeek AI’s new open‑source model DeepSeek‑R1 leverages a novel Group‑Related Policy Optimization (GRPO) reinforcement‑learning framework and multi‑stage training to dramatically boost complex reasoning performance, achieving AIME 2024 Pass@1 scores comparable to OpenAI’s o1 model.

AIDeepSeekGRPO

0 likes · 4 min read

DeepSeek R1 Introduces Group‑Related Policy Optimization for Advanced Reasoning in Large Language Models