Tagged articles
12 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 20, 2026 · Artificial Intelligence

How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy

The ICLR‑2026 paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 curated examples and 50 training steps to cut large‑model chain‑of‑thought generation length by about 50% while maintaining or even improving answer accuracy, dramatically reducing training and inference costs.

Chain-of-ThoughtEfficient InferenceLCPO
0 likes · 8 min read
How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 14, 2026 · Artificial Intelligence

Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)

APO (Autonomous Preference Optimization) converts the drift and conflict among multiple teacher multimodal LLMs into dynamic negative constraints while treating consensus as a positive preference, enabling robust concept alignment and superior diagnostic accuracy on the CXR‑MAX benchmark, as demonstrated by extensive ICML‑2026 experiments.

APOICML 2026Preference Optimization
0 likes · 11 min read
Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)
AI Frontier Lectures
AI Frontier Lectures
Jan 21, 2026 · Artificial Intelligence

How AP2O‑Coder Cuts LLM Code Errors by Up to 3% with Adaptive Preference Optimization

The paper introduces AP2O‑Coder, an adaptive progressive preference optimization framework that systematically captures error types, progressively refines LLM code generation, and dynamically adapts training data, achieving up to a 3% pass@k improvement across multiple open‑source models while reducing data requirements.

AP2O-CoderLLMPreference Optimization
0 likes · 11 min read
How AP2O‑Coder Cuts LLM Code Errors by Up to 3% with Adaptive Preference Optimization
Kuaishou Tech
Kuaishou Tech
Dec 3, 2025 · Artificial Intelligence

Can Diffusion Models Be Their Own Reward Model? Latent Reward Modeling & Step-Level Preference Optimization

This article presents a novel paradigm—Latent Reward Model (LRM) and Latent Preference Optimization (LPO)—that repurposes diffusion models as noise‑aware latent reward models for step‑level preference optimization, addressing the shortcomings of pixel‑level reward models, introducing multi‑preference consistent filtering, and demonstrating significant performance and efficiency gains on benchmarks such as PickScore and T2I‑CompBench++.

AI AlignmentDiffusion ModelsImage Generation
0 likes · 9 min read
Can Diffusion Models Be Their Own Reward Model? Latent Reward Modeling & Step-Level Preference Optimization
Meituan Technology Team
Meituan Technology Team
Jul 31, 2025 · Artificial Intelligence

8 Must-Read ACL 2025 Papers from Meituan: Generative Retrieval, Multimodal LLMs & More

Meituan’s research team showcases eight ACL 2025 papers spanning generative retrieval, multi‑objective preference alignment, rich‑text image understanding, cross‑language transfer, multimodal math reasoning, and more, offering insights and breakthroughs that can inspire and aid fellow researchers.

ACL 2025Code-SwitchingGenerative Retrieval
0 likes · 15 min read
8 Must-Read ACL 2025 Papers from Meituan: Generative Retrieval, Multimodal LLMs & More
DataFunSummit
DataFunSummit
Nov 28, 2024 · Artificial Intelligence

Generative Retrieval for E‑commerce Search: Lexical and SemanticID Approaches

This article presents a comprehensive study of generative retrieval for large‑scale e‑commerce search, detailing background challenges, the advantages of generative methods, two concrete strategies—Lexical‑based and SemanticID‑based—along with task redesign, preference optimization, constrained beam search, extensive experiments, and future research directions.

E-commerce SearchGenerative RetrievalPreference Optimization
0 likes · 21 min read
Generative Retrieval for E‑commerce Search: Lexical and SemanticID Approaches
Bilibili Tech
Bilibili Tech
Nov 5, 2024 · Artificial Intelligence

Bilibili's In-House Role-Playing Large Language Model: Architecture, Training Stages, Evaluation, and Demonstrations

Bilibili’s in‑house role‑playing large language model, built on the Index architecture and refined through pre‑training, supervised fine‑tuning, and preference optimization (PPO and DPO), achieved top scores on the Chinese CharacterEval benchmark, surpassing rivals while incorporating safety alignment and showcasing consistent, personality‑driven dialogue examples.

Content SafetyPreference OptimizationSupervised Fine‑Tuning
0 likes · 13 min read
Bilibili's In-House Role-Playing Large Language Model: Architecture, Training Stages, Evaluation, and Demonstrations
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 10, 2024 · Artificial Intelligence

How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models

This article breaks down the mathematical derivation of Direct Preference Optimization (DPO), showing how it replaces the traditional RLHF‑PPO pipeline by directly training an alignment model from human preference data, eliminating the need for a separate reward model and simplifying the overall training process.

DPOLLM alignmentPreference Optimization
0 likes · 17 min read
How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models
NewBeeNLP
NewBeeNLP
Aug 7, 2024 · Artificial Intelligence

Can Intuitive Fine‑Tuning Replace Expensive RLHF and DPO for LLM Alignment?

This article analyses the shortcomings of current large language model training methods such as SFT, RLHF and DPO, explains why they incur high data and compute costs, and introduces Intuitive Fine‑Tuning (IFT) with temporal residual connections as a cheaper yet effective alternative that better aligns training objectives with real generation tasks.

DPOIntuitive Fine-TuningLLM
0 likes · 15 min read
Can Intuitive Fine‑Tuning Replace Expensive RLHF and DPO for LLM Alignment?
NewBeeNLP
NewBeeNLP
May 13, 2024 · Artificial Intelligence

Why DPO Treats LLMs as Q‑Functions: A Deep Theoretical Dive

This article offers a detailed theoretical interpretation of the DPO algorithm, showing how large language models can be viewed as Q‑functions, unifying sequence‑wise and step‑wise decision perspectives, and discussing the resulting implications for reinforcement‑learning‑based alignment research.

DPOLLMPreference Optimization
0 likes · 14 min read
Why DPO Treats LLMs as Q‑Functions: A Deep Theoretical Dive