Tagged articles

preference learning

3 articles · Page 1 of 1

Jun 20, 2026 · Artificial Intelligence

DrPO: Ranking‑Only Rewards Boost One‑Step Text‑to‑Image Preference Optimization by 3.51×

DrPO introduces a ranking‑only reward that builds a drift field from on‑policy image samples to fine‑tune one‑step text‑to‑image models, achieving up to 3.51× faster training on large multimodal rewards, supporting non‑differentiable signals, and demonstrating superior quality across multiple benchmarks.

Drifting Preference Optimizationdrift fieldnon-differentiable reward

0 likes · 14 min read

DrPO: Ranking‑Only Rewards Boost One‑Step Text‑to‑Image Preference Optimization by 3.51×

DataFunSummit

Mar 30, 2025 · Artificial Intelligence

RLHF Techniques and Challenges in Large Language Models and Multimodal Applications

This article reviews reinforcement learning, RLHF, and related alignment techniques for large language models and multimodal systems, covering fundamentals, recent advances such as InstructGPT, Constitutional AI, RLAIF, Super Alignment, GPT‑4o, video LLMs, and experimental evaluations of proposed methods.

RLHFmultimodal alignmentpreference learning

0 likes · 26 min read

RLHF Techniques and Challenges in Large Language Models and Multimodal Applications

Baobao Algorithm Notes

Jul 9, 2024 · Artificial Intelligence

Why Step-Level DPO Is Revolutionizing LLM Math Reasoning

This article reviews recent step‑level DPO research, compares it with instance‑level DPO, explains the underlying Monte Carlo Tree Search formulation, and presents the author’s own replication experiments that demonstrate consistent performance gains across multiple LLM sizes on GSM8K and MATH benchmarks.

AI researchLLM alignmentMCTS

0 likes · 10 min read

Why Step-Level DPO Is Revolutionizing LLM Math Reasoning