Tagged articles

Preference Optimization

14 articles · Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

Jun 21, 2026 · Artificial Intelligence

Rank‑Only Rewards Accelerate One‑Step Text‑to‑Image Preference Optimization 3.5×

DrPO introduces a drifting‑field based, rank‑only reward mechanism for one‑step text‑to‑image models, enabling reinforcement‑learning‑after‑training without back‑propagating reward gradients; it speeds up training 3.51× versus DRaFT, works with non‑differentiable rewards, and improves generation quality on SD‑Turbo and SDXL‑Turbo.

DrPODrifting ModelHPSv3

0 likes · 11 min read

Rank‑Only Rewards Accelerate One‑Step Text‑to‑Image Preference Optimization 3.5×

Data Party THU

May 30, 2026 · Artificial Intelligence

How USTC’s Tiny LCPO Training Cuts Large Model Overthinking in Half

The paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 training examples and 50 steps to teach large language models to produce concise, accurate answers, halving inference length while often improving accuracy and reducing training cost by up to two orders of magnitude.

Efficient InferenceLCPOLarge Language Models

0 likes · 8 min read

How USTC’s Tiny LCPO Training Cuts Large Model Overthinking in Half

Machine Learning Algorithms & Natural Language Processing

May 20, 2026 · Artificial Intelligence

How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy

The ICLR‑2026 paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 curated examples and 50 training steps to cut large‑model chain‑of‑thought generation length by about 50% while maintaining or even improving answer accuracy, dramatically reducing training and inference costs.

Chain-of-ThoughtEfficient InferenceLCPO

0 likes · 8 min read

How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy

Machine Learning Algorithms & Natural Language Processing

May 14, 2026 · Artificial Intelligence

Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)

APO (Autonomous Preference Optimization) converts the drift and conflict among multiple teacher multimodal LLMs into dynamic negative constraints while treating consensus as a positive preference, enabling robust concept alignment and superior diagnostic accuracy on the CXR‑MAX benchmark, as demonstrated by extensive ICML‑2026 experiments.

APOICML 2026Knowledge Distillation

0 likes · 11 min read

Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)

AI Frontier Lectures

Jan 21, 2026 · Artificial Intelligence

How AP2O‑Coder Cuts LLM Code Errors by Up to 3% with Adaptive Preference Optimization

The paper introduces AP2O‑Coder, an adaptive progressive preference optimization framework that systematically captures error types, progressively refines LLM code generation, and dynamically adapts training data, achieving up to a 3% pass@k improvement across multiple open‑source models while reducing data requirements.

AP2O-CoderLLMPreference Optimization

0 likes · 11 min read

How AP2O‑Coder Cuts LLM Code Errors by Up to 3% with Adaptive Preference Optimization

Tencent Advertising Technology

Dec 4, 2025 · Artificial Intelligence

How POPEN Boosts LVLM Reasoning Segmentation with Preference Optimization and Ensemble

The paper introduces POPEN, a new framework that uses preference‑based optimization and ensemble methods to reduce hallucinations and improve segmentation accuracy in large visual language models, achieving state‑of‑the‑art results on multiple benchmarks.

LVLMPreference OptimizationSegmentation

0 likes · 14 min read

How POPEN Boosts LVLM Reasoning Segmentation with Preference Optimization and Ensemble

Kuaishou Tech

Dec 3, 2025 · Artificial Intelligence

Can Diffusion Models Be Their Own Reward Model? Latent Reward Modeling & Step-Level Preference Optimization

This article presents a novel paradigm—Latent Reward Model (LRM) and Latent Preference Optimization (LPO)—that repurposes diffusion models as noise‑aware latent reward models for step‑level preference optimization, addressing the shortcomings of pixel‑level reward models, introducing multi‑preference consistent filtering, and demonstrating significant performance and efficiency gains on benchmarks such as PickScore and T2I‑CompBench++.

AI alignmentDiffusion ModelsPreference Optimization

0 likes · 9 min read

Can Diffusion Models Be Their Own Reward Model? Latent Reward Modeling & Step-Level Preference Optimization

Meituan Technology Team

Jul 31, 2025 · Artificial Intelligence

8 Must-Read ACL 2025 Papers from Meituan: Generative Retrieval, Multimodal LLMs & More

Meituan’s research team showcases eight ACL 2025 papers spanning generative retrieval, multi‑objective preference alignment, rich‑text image understanding, cross‑language transfer, multimodal math reasoning, and more, offering insights and breakthroughs that can inspire and aid fellow researchers.

ACL 2025Code-SwitchingGenerative Retrieval

0 likes · 15 min read

8 Must-Read ACL 2025 Papers from Meituan: Generative Retrieval, Multimodal LLMs & More

JD Tech Talk

Mar 13, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation with Multimodal Large Language Models

This paper proposes CAIG, a novel method for generating high-CTR advertising images using multimodal large language models, combining reinforcement learning and preference optimization to align generated content with product features.

CTR PredictionMultimodal Large Language ModelsPreference Optimization

0 likes · 10 min read

CTR-Driven Advertising Image Generation with Multimodal Large Language Models

DataFunSummit

Nov 28, 2024 · Artificial Intelligence

Generative Retrieval for E‑commerce Search: Lexical and SemanticID Approaches

This article presents a comprehensive study of generative retrieval for large‑scale e‑commerce search, detailing background challenges, the advantages of generative methods, two concrete strategies—Lexical‑based and SemanticID‑based—along with task redesign, preference optimization, constrained beam search, extensive experiments, and future research directions.

E-commerce SearchGenerative RetrievalPreference Optimization

0 likes · 21 min read

Generative Retrieval for E‑commerce Search: Lexical and SemanticID Approaches

Bilibili Tech

Nov 5, 2024 · Artificial Intelligence

Bilibili's In-House Role-Playing Large Language Model: Architecture, Training Stages, Evaluation, and Demonstrations

Bilibili’s in‑house role‑playing large language model, built on the Index architecture and refined through pre‑training, supervised fine‑tuning, and preference optimization (PPO and DPO), achieved top scores on the Chinese CharacterEval benchmark, surpassing rivals while incorporating safety alignment and showcasing consistent, personality‑driven dialogue examples.

Content SafetyPreference OptimizationSupervised Fine‑Tuning

0 likes · 13 min read

Bilibili's In-House Role-Playing Large Language Model: Architecture, Training Stages, Evaluation, and Demonstrations

Baobao Algorithm Notes

Sep 10, 2024 · Artificial Intelligence

How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models

This article breaks down the mathematical derivation of Direct Preference Optimization (DPO), showing how it replaces the traditional RLHF‑PPO pipeline by directly training an alignment model from human preference data, eliminating the need for a separate reward model and simplifying the overall training process.

DPOLLM alignmentPreference Optimization

0 likes · 17 min read

How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models

NewBeeNLP

Aug 7, 2024 · Artificial Intelligence

Can Intuitive Fine‑Tuning Replace Expensive RLHF and DPO for LLM Alignment?

This article analyses the shortcomings of current large language model training methods such as SFT, RLHF and DPO, explains why they incur high data and compute costs, and introduces Intuitive Fine‑Tuning (IFT) with temporal residual connections as a cheaper yet effective alternative that better aligns training objectives with real generation tasks.

DPOIntuitive Fine-TuningLLM

0 likes · 15 min read

Can Intuitive Fine‑Tuning Replace Expensive RLHF and DPO for LLM Alignment?

NewBeeNLP

May 13, 2024 · Artificial Intelligence

Why DPO Treats LLMs as Q‑Functions: A Deep Theoretical Dive

This article offers a detailed theoretical interpretation of the DPO algorithm, showing how large language models can be viewed as Q‑functions, unifying sequence‑wise and step‑wise decision perspectives, and discussing the resulting implications for reinforcement‑learning‑based alignment research.

DPOLLMPreference Optimization

0 likes · 14 min read

Why DPO Treats LLMs as Q‑Functions: A Deep Theoretical Dive