Tagged articles

Reward Modeling

29 articles · Page 1 of 1

Jun 27, 2026 · Artificial Intelligence

Defining a Good Answer in the Agent Era: A Rubrics Survey

This survey examines how rubrics—structured, multi‑dimensional evaluation criteria—are defined, constructed, and applied to train and evaluate large language models, especially for open‑ended, high‑risk and agentic tasks, while highlighting current challenges such as reward hacking and bias.

AI safetyAgentEvaluation

0 likes · 15 min read

Defining a Good Answer in the Agent Era: A Rubrics Survey

PaperAgent

Jun 11, 2026 · Artificial Intelligence

Skill‑RM Shows More Resources Can Harm LLM Scoring – A Deep Dive into Alibaba’s New Evaluation Framework

The Skill‑RM paper reveals that simply appending evaluation resources can degrade large‑model scoring, while structuring those resources into a Reward‑Evaluation Skill boosts performance across benchmarks, best‑of‑N selection, and RL‑based instruction following.

Alibaba QwenLarge Language ModelsRLHF

0 likes · 7 min read

Skill‑RM Shows More Resources Can Harm LLM Scoring – A Deep Dive into Alibaba’s New Evaluation Framework

Data Party THU

Jun 5, 2026 · Artificial Intelligence

A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL

This article reviews the five‑year evolution of reinforcement‑learning techniques for large language models, comparing PPO, DPO, GRPO and emerging multi‑agent approaches, analyzing their reward signals, practical trade‑offs, and the open‑source frameworks that support them.

DPOGRPOLLM

0 likes · 34 min read

A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL

Machine Heart

May 6, 2026 · Artificial Intelligence

PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels

PromptEcho computes a continuous reward for text‑to‑image generation by measuring how well a frozen vision‑language model can reconstruct the original prompt from the generated image, eliminating the need for annotated data or a trained reward model and outperforming prior methods across multiple benchmarks.

PromptEchoReward Modelingbenchmark

0 likes · 10 min read

PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels

PaperAgent

Apr 30, 2026 · Artificial Intelligence

Why Reinforcement Learning Is the Future: 2026 Top‑Conference RL Paper Collection

The article highlights the rapid rise of reinforcement learning across major 2026 conferences, curates 181 RL papers from eight top venues, and provides detailed summaries of innovative works such as MSRL and MedVR, offering free access to the papers and code.

Agentic RLMultimodal AIReward Modeling

0 likes · 6 min read

Why Reinforcement Learning Is the Future: 2026 Top‑Conference RL Paper Collection

Machine Heart

Mar 31, 2026 · Artificial Intelligence

Can LLM Judges Be Trusted? TrustJudge Leverages Full Probability Distributions

LLM judges often produce contradictory scores and non‑transitive preferences; the TrustJudge framework replaces discrete scoring with distribution‑sensitive scoring and likelihood‑aware aggregation, dramatically reducing both score‑comparison and pairwise‑transitivity inconsistencies across multiple model families, improving accuracy and even serving as a reward signal for RL training.

LLM evaluationReward ModelingTrustJudge

0 likes · 12 min read

Can LLM Judges Be Trusted? TrustJudge Leverages Full Probability Distributions

PaperAgent

Mar 3, 2026 · Artificial Intelligence

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

The article presents CharacterFlywheel, a 15‑generation flywheel methodology that iteratively improves social‑dialogue LLMs in production using data‑driven reward models, rejection sampling, and a mix of SFT, DPO, and RL, with detailed experiments and best‑practice insights.

AI safetyLLM OptimizationReward Modeling

0 likes · 12 min read

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

DaTaobao Tech

Jan 30, 2026 · Artificial Intelligence

Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling

This article proposes an ASR‑driven pipeline that creates high‑quality AI‑reply vs. human‑like reply pairs, trains a rewrite model and a reward model, and uses GRPO reinforcement learning to generate natural, helpful, and less AI‑sounding responses in digital‑human live streaming, achieving 92% accuracy and 97% helpfulness while improving user experience.

ASR dataLLMQwen

0 likes · 20 min read

Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling

Wu Shixiong's Large Model Academy

Dec 11, 2025 · Artificial Intelligence

Why Reward Models Need Reasoning: From Scalar Scores to RM‑R1

Interviewers increasingly ask why modern reward models must go beyond scalar scores to incorporate reasoning, and this article explains the limitations of traditional scalar reward models, the benefits of the RM‑R1 framework, and how reasoning‑based rewards improve alignment, stability, and task performance in large language model training.

AI alignmentLLMRLHF

0 likes · 11 min read

Why Reward Models Need Reasoning: From Scalar Scores to RM‑R1

Wu Shixiong's Large Model Academy

Dec 10, 2025 · Artificial Intelligence

Why RLHF Success Relies on Data Engineering, Not Just Model Tricks

The article explains that the real difficulty of RLHF lies in designing and curating high‑quality preference data, building robust reward models through bad‑case rewriting, human‑in‑the‑loop labeling, and inference‑based reward modeling, while algorithmic details like PPO are secondary concerns.

Data EngineeringGRPOLarge Language Models

0 likes · 9 min read

Why RLHF Success Relies on Data Engineering, Not Just Model Tricks

Kuaishou Tech

Nov 24, 2025 · Artificial Intelligence

How Human Feedback Supercharges Video Generation – The VideoAlign Pipeline Explained

This article details a new research pipeline that leverages large‑scale human preference data, a multi‑dimensional video reward model, and specialized alignment algorithms to dramatically improve video generation quality, motion fidelity, and text‑video consistency, with open‑source code and benchmarks for reproducibility.

AI alignmentHuman FeedbackRLHF

0 likes · 10 min read

How Human Feedback Supercharges Video Generation – The VideoAlign Pipeline Explained

Bilibili Tech

Oct 31, 2025 · Artificial Intelligence

RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization) introduces an adversarial game between a reward model and a translation LLM, combining qualitative preference rewards with quantitative metrics like BLEU, to overcome distribution shift in RLHF and achieve superior performance on conversational subtitle and WMT translation tasks.

BLEULLMMachine Translation

0 likes · 13 min read

RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

Baobao Algorithm Notes

Oct 30, 2025 · Artificial Intelligence

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

AILLMReward Modeling

0 likes · 11 min read

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

Baobao Algorithm Notes

Sep 3, 2025 · Artificial Intelligence

How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards

Atom-Searcher introduces an atomic‑thought reinforcement‑learning framework that decomposes complex reasoning into fine‑grained units, uses a Reasoning Reward Model to assign step‑wise rewards, dynamically balances process and result incentives, and achieves state‑of‑the‑art performance on multiple LLM benchmarks.

Agentic ResearchAtomic ThoughtLLM

0 likes · 12 min read

How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards

Network Intelligence Research Center (NIRC)

Aug 27, 2025 · Artificial Intelligence

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Perception‑R1, a post‑training framework that applies rule‑based reinforcement learning to existing multimodal LLMs, dramatically improves visual perception tasks such as grounding, OCR, counting and object detection, as demonstrated by extensive benchmarks and ablation studies.

GRPOPerception PolicyReward Modeling

0 likes · 10 min read

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Java Tech Enthusiast

Jul 17, 2025 · Artificial Intelligence

How a Simple Colon Can Fool Top LLMs – The ‘Universal Key’ Vulnerability Exposed

Researchers discovered that trivial symbols such as a colon or the word “Solution” can trigger false‑positive rewards in LLM judge models, causing GPT‑4o, Claude‑4 and LLaMA‑3‑70B to fail, and proposed a robust “Master‑RM” model that eliminates these attacks.

AI robustnessLLM securityReward Modeling

0 likes · 10 min read

How a Simple Colon Can Fool Top LLMs – The ‘Universal Key’ Vulnerability Exposed

AIWalker

Jun 18, 2025 · Artificial Intelligence

Six New Directions for Large Language Models

Large language models are booming, and this article highlights six cutting‑edge research directions—LLM‑plus synthetic data, reward modeling, inference techniques, LLM‑as‑a‑Judge, safety alignment, and long‑context handling—each illustrated with recent papers, experimental results, and links to code repositories.

LLMLong ContextReward Modeling

0 likes · 9 min read

Six New Directions for Large Language Models

Fighter's World

Jun 14, 2025 · Artificial Intelligence

How Can LLMs Learn to “Think” in Complex Industry Scenarios?

The article analyzes how large language models can acquire true reasoning abilities for hard‑to‑score industry tasks by combining Chain‑of‑Thought prompting with reinforcement learning, addressing vague reward signals, reward hacking, and loyalty, and proposing a toolbox of reward engineering, synthetic data, hierarchical RL and multi‑agent collaboration.

Chain-of-ThoughtLLMMulti-Agent Systems

0 likes · 22 min read

How Can LLMs Learn to “Think” in Complex Industry Scenarios?

JD Cloud Developers

May 27, 2025 · Artificial Intelligence

How JD’s Young AI Engineers Tackle Real-World Model Challenges

Young JD algorithm engineers share how they solve tough AI problems—from optimizing large‑model training and reward‑model design for ad image generation, to building LLM‑based query expansion, agent evaluation, and model pruning with FFT and RDP—illustrating practical breakthroughs and personal growth in cutting‑edge AI research.

AILarge Language ModelsModel Pruning

0 likes · 15 min read

How JD’s Young AI Engineers Tackle Real-World Model Challenges

Bilibili Tech

May 20, 2025 · Artificial Intelligence

How AnimeReward and GAPO Transform Anime Video Generation with Human Feedback

Researchers at Bilibili present Index‑Anisora, an open‑source anime video generation framework that builds a 30k‑sample reward dataset, introduces the multi‑dimensional AnimeReward model and a Gap‑Aware Preference Optimization (GAPO) method, and demonstrate through extensive automatic and human evaluations that their approach significantly outperforms baseline video generators.

AIGAPOHuman Feedback

0 likes · 20 min read

How AnimeReward and GAPO Transform Anime Video Generation with Human Feedback

JD Retail Technology

May 7, 2025 · Artificial Intelligence

Solving Technical Challenges with Large AI Models at JD Retail: Reward Modeling, Query Expansion, and Model Pruning

JD Retail’s engineering team tackles hard AI problems by replacing a monolithic reward model with specialized small models for ad‑image generation, deploying an LLM‑driven query‑expansion pipeline that lifts conversion rates, and pruning text‑to‑image transformers using FFT and RDP to boost throughput 40% without loss, while building comprehensive evaluation tools and a semantic smart‑assistant.

AIModel PruningQuery Expansion

0 likes · 14 min read

Solving Technical Challenges with Large AI Models at JD Retail: Reward Modeling, Query Expansion, and Model Pruning

Architect

Mar 16, 2025 · Artificial Intelligence

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

This article walks through the complete lifecycle of building a small large‑language model, covering token‑level inference, pre‑training, post‑training steps such as supervised fine‑tuning, reward‑model creation, and reinforcement‑learning methods like DPO, PPO and GRPO, culminating in a practical 0.5B model fine‑tuned for chain‑of‑thought reasoning.

Chain-of-ThoughtGRPOLLM training

0 likes · 22 min read

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

Architect

Feb 25, 2025 · Artificial Intelligence

DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM

DeepSeek R1 builds on the DeepSeek V3 base model using a multi‑stage reinforcement learning pipeline—including GRPO optimization, rule‑based reward modeling, supervised fine‑tuning, language‑consistency rewards, rejection sampling, and distillation—to produce a high‑performing, aligned LLM capable of accurate reasoning.

DeepSeekLLM trainingReward Modeling

0 likes · 24 min read

DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM

Baobao Algorithm Notes

Jan 8, 2025 · Artificial Intelligence

Inside Llama 3.1, DeepSeek‑V3, TÜLU 3 & Qwen 2.5: A Deep Dive into Post‑Training Techniques

This article compiles and analyzes the post‑training pipelines of Llama 3.1, DeepSeek‑V3, TÜLU 3 and Qwen 2.5, detailing their data compositions, SFT, reward modeling, DPO, GRPO, RLVR methods, hyper‑parameters, and practical tricks for large‑language‑model alignment.

DPODeepSeek-V3Llama3.1

0 likes · 22 min read

Inside Llama 3.1, DeepSeek‑V3, TÜLU 3 & Qwen 2.5: A Deep Dive into Post‑Training Techniques

Baobao Algorithm Notes

Oct 22, 2024 · Artificial Intelligence

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

This article analytically explores the implicit assumptions behind the RLHF optimization objective, examines how they limit DPO and PPO methods, and proposes practical improvements such as rejection sampling and online on‑policy data selection to narrow the gap between theory and practice.

AI alignmentDPOPPO

0 likes · 22 min read

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

Baobao Algorithm Notes

Oct 21, 2024 · Artificial Intelligence

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

This article provides a thorough, four‑part overview of RLHF for large language models, covering preference‑optimization algorithms (PPO‑based and offline RL approaches), reward‑model training techniques, inference‑time exploration strategies, and practical implementation details including the OpenRLHF framework and resource‑allocation tricks.

DPOLLM OptimizationOpenRLHF

0 likes · 27 min read

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

NewBeeNLP

Sep 5, 2024 · Artificial Intelligence

Why RLHF Is Irreplaceable: Uncovering the Limits of SFT

The article analyzes why supervised fine‑tuning (SFT) cannot replace reinforcement learning from human feedback (RLHF), highlighting SFT's lack of negative feedback and backward‑looking capability, and explains how RLHF’s reward model addresses these fundamental shortcomings.

Language ModelsRLHFReward Modeling

0 likes · 7 min read

Why RLHF Is Irreplaceable: Uncovering the Limits of SFT

Baobao Algorithm Notes

Oct 9, 2023 · Artificial Intelligence

Demystifying RLHF and PPO for Large Language Models: Theory and Practice

This article explains why Reinforcement Learning from Human Feedback (RLHF) is crucial for LLM intelligence, outlines the three-stage training pipeline, details InstructGPT's reward model and PPO optimization, and provides a practical guide to implementing RLHF with deep‑learning frameworks.

Artificial IntelligenceLarge Language ModelsPPO

0 likes · 17 min read

Demystifying RLHF and PPO for Large Language Models: Theory and Practice

Python Crawling & Data Mining

Aug 20, 2023 · Artificial Intelligence

What Is RLHF? Benefits, Limits, and Design Tips for Human‑Feedback Reinforcement Learning

This article explains Reinforcement Learning with Human Feedback (RLHF), outlining its definition, suitable tasks, advantages over other reward‑model methods, types of algorithms, challenges of human feedback, and practical strategies to mitigate its limitations for building robust AI systems.

AI alignmentHuman FeedbackReward Modeling

0 likes · 14 min read

What Is RLHF? Benefits, Limits, and Design Tips for Human‑Feedback Reinforcement Learning