Tagged articles
26 articles
Page 1 of 1
Machine Heart
Machine Heart
May 6, 2026 · Artificial Intelligence

PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels

PromptEcho computes a continuous reward for text‑to‑image generation by measuring how well a frozen vision‑language model can reconstruct the original prompt from the generated image, eliminating the need for annotated data or a trained reward model and outperforming prior methods across multiple benchmarks.

BenchmarkPromptEchoReward Modeling
0 likes · 10 min read
PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels
Machine Heart
Machine Heart
Mar 31, 2026 · Artificial Intelligence

Can LLM Judges Be Trusted? TrustJudge Leverages Full Probability Distributions

LLM judges often produce contradictory scores and non‑transitive preferences; the TrustJudge framework replaces discrete scoring with distribution‑sensitive scoring and likelihood‑aware aggregation, dramatically reducing both score‑comparison and pairwise‑transitivity inconsistencies across multiple model families, improving accuracy and even serving as a reward signal for RL training.

LLM evaluationReward ModelingTrustJudge
0 likes · 12 min read
Can LLM Judges Be Trusted? TrustJudge Leverages Full Probability Distributions
PaperAgent
PaperAgent
Mar 3, 2026 · Artificial Intelligence

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

The article presents CharacterFlywheel, a 15‑generation flywheel methodology that iteratively improves social‑dialogue LLMs in production using data‑driven reward models, rejection sampling, and a mix of SFT, DPO, and RL, with detailed experiments and best‑practice insights.

AI SafetyLLM optimizationReward Modeling
0 likes · 12 min read
How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization
DaTaobao Tech
DaTaobao Tech
Jan 30, 2026 · Artificial Intelligence

Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling

This article proposes an ASR‑driven pipeline that creates high‑quality AI‑reply vs. human‑like reply pairs, trains a rewrite model and a reward model, and uses GRPO reinforcement learning to generate natural, helpful, and less AI‑sounding responses in digital‑human live streaming, achieving 92% accuracy and 97% helpfulness while improving user experience.

ASR dataLLMQwen
0 likes · 20 min read
Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Dec 11, 2025 · Artificial Intelligence

Why Reward Models Need Reasoning: From Scalar Scores to RM‑R1

Interviewers increasingly ask why modern reward models must go beyond scalar scores to incorporate reasoning, and this article explains the limitations of traditional scalar reward models, the benefits of the RM‑R1 framework, and how reasoning‑based rewards improve alignment, stability, and task performance in large language model training.

AI AlignmentLLMRLHF
0 likes · 11 min read
Why Reward Models Need Reasoning: From Scalar Scores to RM‑R1
Kuaishou Tech
Kuaishou Tech
Nov 24, 2025 · Artificial Intelligence

How Human Feedback Supercharges Video Generation – The VideoAlign Pipeline Explained

This article details a new research pipeline that leverages large‑scale human preference data, a multi‑dimensional video reward model, and specialized alignment algorithms to dramatically improve video generation quality, motion fidelity, and text‑video consistency, with open‑source code and benchmarks for reproducibility.

AI AlignmentBenchmarkHuman Feedback
0 likes · 10 min read
How Human Feedback Supercharges Video Generation – The VideoAlign Pipeline Explained
Bilibili Tech
Bilibili Tech
Oct 31, 2025 · Artificial Intelligence

RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization) introduces an adversarial game between a reward model and a translation LLM, combining qualitative preference rewards with quantitative metrics like BLEU, to overcome distribution shift in RLHF and achieve superior performance on conversational subtitle and WMT translation tasks.

BLEULLMReward Modeling
0 likes · 13 min read
RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 30, 2025 · Artificial Intelligence

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

AILLMReward Modeling
0 likes · 11 min read
Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 3, 2025 · Artificial Intelligence

How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards

Atom-Searcher introduces an atomic‑thought reinforcement‑learning framework that decomposes complex reasoning into fine‑grained units, uses a Reasoning Reward Model to assign step‑wise rewards, dynamically balances process and result incentives, and achieves state‑of‑the‑art performance on multiple LLM benchmarks.

Agentic ResearchAtomic ThoughtLLM
0 likes · 12 min read
How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Aug 27, 2025 · Artificial Intelligence

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Perception‑R1, a post‑training framework that applies rule‑based reinforcement learning to existing multimodal LLMs, dramatically improves visual perception tasks such as grounding, OCR, counting and object detection, as demonstrated by extensive benchmarks and ablation studies.

GRPOPerception PolicyReward Modeling
0 likes · 10 min read
Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision
AIWalker
AIWalker
Jun 18, 2025 · Artificial Intelligence

Six New Directions for Large Language Models

Large language models are booming, and this article highlights six cutting‑edge research directions—LLM‑plus synthetic data, reward modeling, inference techniques, LLM‑as‑a‑Judge, safety alignment, and long‑context handling—each illustrated with recent papers, experimental results, and links to code repositories.

InferenceLLMReward Modeling
0 likes · 9 min read
Six New Directions for Large Language Models
Fighter's World
Fighter's World
Jun 14, 2025 · Artificial Intelligence

How Can LLMs Learn to “Think” in Complex Industry Scenarios?

The article analyzes how large language models can acquire true reasoning abilities for hard‑to‑score industry tasks by combining Chain‑of‑Thought prompting with reinforcement learning, addressing vague reward signals, reward hacking, and loyalty, and proposing a toolbox of reward engineering, synthetic data, hierarchical RL and multi‑agent collaboration.

LLMReward Modelingchain-of-thought
0 likes · 22 min read
How Can LLMs Learn to “Think” in Complex Industry Scenarios?
JD Cloud Developers
JD Cloud Developers
May 27, 2025 · Artificial Intelligence

How JD’s Young AI Engineers Tackle Real-World Model Challenges

Young JD algorithm engineers share how they solve tough AI problems—from optimizing large‑model training and reward‑model design for ad image generation, to building LLM‑based query expansion, agent evaluation, and model pruning with FFT and RDP—illustrating practical breakthroughs and personal growth in cutting‑edge AI research.

AIModel PruningReward Modeling
0 likes · 15 min read
How JD’s Young AI Engineers Tackle Real-World Model Challenges
Bilibili Tech
Bilibili Tech
May 20, 2025 · Artificial Intelligence

How AnimeReward and GAPO Transform Anime Video Generation with Human Feedback

Researchers at Bilibili present Index‑Anisora, an open‑source anime video generation framework that builds a 30k‑sample reward dataset, introduces the multi‑dimensional AnimeReward model and a Gap‑Aware Preference Optimization (GAPO) method, and demonstrate through extensive automatic and human evaluations that their approach significantly outperforms baseline video generators.

AIAlignmentGAPO
0 likes · 20 min read
How AnimeReward and GAPO Transform Anime Video Generation with Human Feedback
JD Retail Technology
JD Retail Technology
May 7, 2025 · Artificial Intelligence

Solving Technical Challenges with Large AI Models at JD Retail: Reward Modeling, Query Expansion, and Model Pruning

JD Retail’s engineering team tackles hard AI problems by replacing a monolithic reward model with specialized small models for ad‑image generation, deploying an LLM‑driven query‑expansion pipeline that lifts conversion rates, and pruning text‑to‑image transformers using FFT and RDP to boost throughput 40% without loss, while building comprehensive evaluation tools and a semantic smart‑assistant.

AIModel PruningReward Modeling
0 likes · 14 min read
Solving Technical Challenges with Large AI Models at JD Retail: Reward Modeling, Query Expansion, and Model Pruning
Architect
Architect
Mar 16, 2025 · Artificial Intelligence

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

This article walks through the complete lifecycle of building a small large‑language model, covering token‑level inference, pre‑training, post‑training steps such as supervised fine‑tuning, reward‑model creation, and reinforcement‑learning methods like DPO, PPO and GRPO, culminating in a practical 0.5B model fine‑tuned for chain‑of‑thought reasoning.

GRPOLLM trainingReward Modeling
0 likes · 22 min read
Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning
Architect
Architect
Feb 25, 2025 · Artificial Intelligence

DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM

DeepSeek R1 builds on the DeepSeek V3 base model using a multi‑stage reinforcement learning pipeline—including GRPO optimization, rule‑based reward modeling, supervised fine‑tuning, language‑consistency rewards, rejection sampling, and distillation—to produce a high‑performing, aligned LLM capable of accurate reasoning.

DeepSeekLLM trainingReward Modeling
0 likes · 24 min read
DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 22, 2024 · Artificial Intelligence

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

This article analytically explores the implicit assumptions behind the RLHF optimization objective, examines how they limit DPO and PPO methods, and proposes practical improvements such as rejection sampling and online on‑policy data selection to narrow the gap between theory and practice.

AI AlignmentDPOPPO
0 likes · 22 min read
Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 21, 2024 · Artificial Intelligence

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

This article provides a thorough, four‑part overview of RLHF for large language models, covering preference‑optimization algorithms (PPO‑based and offline RL approaches), reward‑model training techniques, inference‑time exploration strategies, and practical implementation details including the OpenRLHF framework and resource‑allocation tricks.

DPOLLM optimizationOpenRLHF
0 likes · 27 min read
Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide
NewBeeNLP
NewBeeNLP
Sep 5, 2024 · Artificial Intelligence

Why RLHF Is Irreplaceable: Uncovering the Limits of SFT

The article analyzes why supervised fine‑tuning (SFT) cannot replace reinforcement learning from human feedback (RLHF), highlighting SFT's lack of negative feedback and backward‑looking capability, and explains how RLHF’s reward model addresses these fundamental shortcomings.

RLHFReward ModelingSFT
0 likes · 7 min read
Why RLHF Is Irreplaceable: Uncovering the Limits of SFT
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 9, 2023 · Artificial Intelligence

Demystifying RLHF and PPO for Large Language Models: Theory and Practice

This article explains why Reinforcement Learning from Human Feedback (RLHF) is crucial for LLM intelligence, outlines the three-stage training pipeline, details InstructGPT's reward model and PPO optimization, and provides a practical guide to implementing RLHF with deep‑learning frameworks.

PPORLHFReward Modeling
0 likes · 17 min read
Demystifying RLHF and PPO for Large Language Models: Theory and Practice
Python Crawling & Data Mining
Python Crawling & Data Mining
Aug 20, 2023 · Artificial Intelligence

What Is RLHF? Benefits, Limits, and Design Tips for Human‑Feedback Reinforcement Learning

This article explains Reinforcement Learning with Human Feedback (RLHF), outlining its definition, suitable tasks, advantages over other reward‑model methods, types of algorithms, challenges of human feedback, and practical strategies to mitigate its limitations for building robust AI systems.

AI AlignmentHuman FeedbackReward Modeling
0 likes · 14 min read
What Is RLHF? Benefits, Limits, and Design Tips for Human‑Feedback Reinforcement Learning