Direct Preference Optimization — 5 Technical Articles

Apr 8, 2026 · Artificial Intelligence

From a Single Image to a Physically Realistic 4D Video in One Minute

PhysGM, a CVPR 2026 paper by Beijing Institute of Technology and Li Auto, transforms a single static image into a high‑fidelity 4D video that obeys real‑world physics in under a minute, using a dual‑decoder transformer, DPO alignment, and a newly built 50k‑item PhysAssets dataset, outperforming prior methods in speed and quality.

3D Gaussian SplattingCVPR 2026Direct Preference Optimization

0 likes · 7 min read

From a Single Image to a Physically Realistic 4D Video in One Minute

Machine Learning Algorithms & Natural Language Processing

Feb 11, 2026 · Artificial Intelligence

Can TI‑DPO Fix DPO’s Blind Spot? Token‑Importance Guided Direct Preference Optimization for Better LLM Alignment

TI‑DPO introduces a hybrid weighting scheme and a triplet‑loss objective that weight tokens by gradient attribution and a Gaussian prior, enabling precise identification of critical tokens and yielding consistent performance gains over DPO, SimPO, and GRPO on Llama‑3, Mistral‑7B, and downstream benchmarks such as IFEval, TruthfulQA, and HumanEval.

Direct Preference OptimizationModel AlignmentRLHF

0 likes · 8 min read

Can TI‑DPO Fix DPO’s Blind Spot? Token‑Importance Guided Direct Preference Optimization for Better LLM Alignment

AIWalker

Mar 17, 2025 · Artificial Intelligence

How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance

The paper introduces UNIFIEDREWARD, the first unified reward model for multimodal understanding and generation that supports pairwise ranking and pointwise scoring, builds a 236K human‑preference dataset across image and video tasks, and uses DPO to align VLMs and diffusion models, achieving significant performance gains on both image and video benchmarks.

Direct Preference OptimizationImage GenerationPreference Modeling

0 likes · 19 min read

How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance

AIWalker

Feb 4, 2025 · Artificial Intelligence

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

This article reviews a comprehensive study that applies Chain‑of‑Thought reasoning to autoregressive text‑to‑image generation, introducing extended test‑time computation, direct preference optimization, and two custom reward models (PARM and PARM++) that together improve generation quality by up to 15% over Stable Diffusion 3.

Direct Preference OptimizationImage GenerationInference

0 likes · 13 min read

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

Baobao Algorithm Notes

Oct 15, 2024 · Artificial Intelligence

How DPO Simplifies RLHF: A Deep Dive into Direct Preference Optimization

This article breaks down how Direct Preference Optimization (DPO) mathematically reduces the two‑stage RLHF pipeline into a single‑stage SFT process, explains the underlying loss transformations, and discusses DPO's practical limitations and trade‑offs for large language model alignment.

DPODirect Preference OptimizationRLHF

0 likes · 9 min read

How DPO Simplifies RLHF: A Deep Dive into Direct Preference Optimization