Tagged articles

DPO

35 articles · Page 1 of 1

Jun 23, 2026 · Artificial Intelligence

How User Memory Skews LLM Emotional Reasoning: Insights from Amazon’s ACL Paper

A recent ACL paper from Amazon reveals that injecting user memory into large language models causes significant performance drops and fairness biases, favoring privileged personas across demographics, but shows that targeted DPO fine‑tuning can mitigate these effects.

AmazonDPOLLM

0 likes · 10 min read

How User Memory Skews LLM Emotional Reasoning: Insights from Amazon’s ACL Paper

Data Party THU

Jun 14, 2026 · Artificial Intelligence

Understanding Large‑Model Reinforcement Learning: Algorithms, Frameworks, and Emerging Trends

This article surveys five years of large‑model reinforcement learning, detailing the evolution from PPO + RLHF to DPO and GRPO, comparing reward‑model‑based and verifiable‑reward approaches, discussing multi‑agent extensions, and evaluating open‑source frameworks for training LLM‑driven agents.

AI alignmentDPOGRPO

0 likes · 34 min read

Understanding Large‑Model Reinforcement Learning: Algorithms, Frameworks, and Emerging Trends

Data Party THU

Jun 5, 2026 · Artificial Intelligence

A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL

This article reviews the five‑year evolution of reinforcement‑learning techniques for large language models, comparing PPO, DPO, GRPO and emerging multi‑agent approaches, analyzing their reward signals, practical trade‑offs, and the open‑source frameworks that support them.

DPOGRPOLLM

0 likes · 34 min read

A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL

DeepHub IMBA

May 19, 2026 · Artificial Intelligence

A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL

The article reviews five years of LLM‑centric reinforcement learning, tracing the evolution from early Q‑learning to PPO, then to Direct Preference Optimization, Group Relative Policy Optimization, and finally multi‑agent RL, detailing each method’s mechanics, strengths, failure modes, practical considerations, and emerging open‑source toolchains.

DPOGRPOLLM alignment

0 likes · 33 min read

PaperAgent

May 6, 2026 · Artificial Intelligence

How to Detect Introspective Awareness in LLMs – Boosting Detection Rates by 53% and 75%

Anthropic and MIT researchers reveal that large language models can sense injected steering vectors, a capability that emerges during post‑training (especially DPO), and they present a two‑stage detection circuit whose performance improves by up to 75% when reject directions are ablated or bias vectors are trained.

Circuit AnalysisDPOIntrospective Awareness

0 likes · 15 min read

How to Detect Introspective Awareness in LLMs – Boosting Detection Rates by 53% and 75%

Weekly Large Model Application

May 5, 2026 · Artificial Intelligence

Understanding Preference Alignment: Why Voice Output Needs an Extra Layer

The article explains that after task alignment, teams can produce functional demos, but true competitiveness requires preference alignment—optimizing for human comfort across dimensions like brevity, tone, and safety—and discusses how RLHF and DPO address this, especially the additional challenges of generating natural, responsive voice output.

AI alignmentDPOHuman Feedback

0 likes · 7 min read

Understanding Preference Alignment: Why Voice Output Needs an Extra Layer

Lao Guo's Learning Space

Apr 2, 2026 · Artificial Intelligence

Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution

This article explains the full lifecycle of large language models in 2026, covering pretraining fundamentals, the limits of classic Scaling Laws, data‑centric advances, fine‑tuning strategies, RLHF, DPO, and the emerging post‑training methods GRPO, DAPO and RLVR, with concrete benchmarks and cost analyses.

DAPODPOGRPO

0 likes · 17 min read

Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution

Machine Learning Algorithms & Natural Language Processing

Mar 28, 2026 · Artificial Intelligence

A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL

This article systematically explains the post‑training pipeline for large language models, covering supervised fine‑tuning, RLHF, PPO, GRPO, RLVR, DPO and emerging Agentic RL, while illustrating each method with analogies, detailed workflows, tables, and recent research findings.

Agentic RLDPOGRPO

0 likes · 24 min read

A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL

Baidu Intelligent Cloud Tech Hub

Jan 27, 2026 · Artificial Intelligence

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

This guide walks through setting up a Kunlun P800 XPU host, preparing Docker containers, deploying Qwen3‑8B/‑32B/‑VL models with vLLM‑Kunlun, benchmarking performance, and running full‑parameter DPO training using LLaMA‑Factory, providing scripts, configuration files, and troubleshooting tips for AI engineers.

DPOKunlun P800LLaMA-Factory

0 likes · 32 min read

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

Amap Tech

Nov 19, 2025 · Artificial Intelligence

How Gaode’s Spacetime‑GR Model Boosts POI Recommendation with AI‑Powered SFT and DPO

Gaode transforms its map app into a dynamic, AI‑driven “living map” by fine‑tuning the large Spacetime‑GR model through embedding‑based and generative ranking SFT, DPO alignment, and multimodal augmentation, achieving significant offline CTR‑AUC improvements and online CTR gains in POI recommendation.

AI recommendationDPOMultimodal

0 likes · 12 min read

How Gaode’s Spacetime‑GR Model Boosts POI Recommendation with AI‑Powered SFT and DPO

DataFunSummit

Nov 3, 2025 · Artificial Intelligence

Boosting Private Agentic AI: LLM Post‑Training, DPO, and End‑to‑End Evaluation

This article shares practical experience on deploying private Agentic AI, covering background, architecture design, challenges, data generation, reinforcement learning with DPO, automated multi‑dimensional evaluation, and future plans for open‑source models and richer tool integration.

Agentic AIDPOLLM fine-tuning

0 likes · 16 min read

Boosting Private Agentic AI: LLM Post‑Training, DPO, and End‑to‑End Evaluation

Zhuanzhuan Tech

Oct 29, 2025 · Artificial Intelligence

How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems

This article examines how reinforcement‑learning techniques such as PPO, DPO, and GRPO are integrated into the Baixiaosheng QA system to improve answer stability, deepen domain knowledge understanding, and accelerate response generation, and it evaluates the impact of Reinforcement Fine‑Tuning (RFT) on real‑world performance.

AIDPOGRPO

0 likes · 16 min read

How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems

Wu Shixiong's Large Model Academy

Aug 26, 2025 · Artificial Intelligence

Mastering RLHF, DPO, and KTO: A Complete Guide to Human‑Feedback Alignment Techniques

This comprehensive guide explains the full RLHF training pipeline, the mathematical foundations of reward modeling and PPO, and introduces DPO and KTO algorithms—including their implementations, advantages, limitations, and practical tuning strategies—for building aligned large language models.

DPOHuman FeedbackKTO

0 likes · 32 min read

Mastering RLHF, DPO, and KTO: A Complete Guide to Human‑Feedback Alignment Techniques

Alibaba Cloud Big Data AI Platform

Jul 23, 2025 · Artificial Intelligence

How to Distill Large Language Models for Efficient Text Generation with EasyDistill

This guide explains how to use the EasyDistill framework and Alibaba Cloud PAI to distill large language models for high‑quality text generation, covering model deployment, SFT and DPO training data construction, code examples, configuration files, and best practices for achieving resource‑efficient, high‑performance student models.

DPOEasyDistillPAI

0 likes · 14 min read

How to Distill Large Language Models for Efficient Text Generation with EasyDistill

Alibaba Cloud Big Data AI Platform

Jul 16, 2025 · Artificial Intelligence

Master Post-Training: Fine-Tune LLMs with SFT, DPO, and GRPO on Alibaba PAI

This article explains post‑training concepts, compares SFT, DPO, and GRPO fine‑tuning methods, and provides step‑by‑step guidance for using Alibaba Cloud's PAI platform—including Model Gallery and DSW—to fine‑tune large language models with code examples and practical tips.

DPOGRPOLLM

0 likes · 14 min read

Master Post-Training: Fine-Tune LLMs with SFT, DPO, and GRPO on Alibaba PAI

DataFunTalk

Jun 29, 2025 · Artificial Intelligence

Large Models Boost Douyin User Experience: Expert Insights

In an interview at the DA Digital Intelligence Conference, ByteDance AI specialist Cai Conghuai explains how large language models, combined with techniques like SFT, DPO, and RAG, are reshaping Douyin's user‑experience signal detection, root‑cause analysis, and evaluation, while outlining future AI‑agent breakthroughs.

AIDPOMultimodal

0 likes · 12 min read

Large Models Boost Douyin User Experience: Expert Insights

DataFunSummit

Jun 22, 2025 · Artificial Intelligence

How Vivo’s BlueHeart AI Assistant Optimizes Post‑Conversation Recommendations with LLMs

In a detailed interview, Vivo AI engineer Liang Tianan explains how the BlueHeart Small V assistant leverages large language models, multi‑stage recall, ranking, and reward‑model fine‑tuning (SFT/DPO) to generate high‑quality, diverse post‑dialogue recommendation items while balancing latency, cost, and evaluation challenges.

DPOLLMSFT

0 likes · 15 min read

How Vivo’s BlueHeart AI Assistant Optimizes Post‑Conversation Recommendations with LLMs

DaTaobao Tech

Jun 4, 2025 · Artificial Intelligence

Understanding Large Language Model Architecture, Parameters, Memory, Storage, and Fine‑Tuning Techniques

This article provides a comprehensive overview of large language models (LLMs), covering their transformer architecture, parameter counts, GPU memory and storage requirements, and detailed fine‑tuning methods such as prompt engineering, data construction, LoRA, PEFT, RLHF, and DPO, along with practical deployment and inference acceleration strategies.

DPOLLMLoRA

0 likes · 17 min read

Understanding Large Language Model Architecture, Parameters, Memory, Storage, and Fine‑Tuning Techniques

Network Intelligence Research Center (NIRC)

Apr 7, 2025 · Artificial Intelligence

Getting Started with Hugging Face TRL: Fine‑tune LLaVA using DPO

This guide introduces Hugging Face's TRL library, explains how to install it alongside Transformers, and walks through modifying LLaVA's trainer, dataset, and data collator to apply the DPO reinforcement‑learning algorithm for multimodal model fine‑tuning.

DPOHugging FaceLLaVA

0 likes · 4 min read

Getting Started with Hugging Face TRL: Fine‑tune LLaVA using DPO

Data Thinking Notes

Mar 16, 2025 · Artificial Intelligence

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

DeepSeek‑R1 replaces the traditional PPO‑based RLHF approach with GRPO, reducing reliance on human‑labeled data by using pure reinforcement learning environments and carefully designed reward mechanisms; the article explains reinforcement learning fundamentals, compares PPO, DPO and GRPO, and offers practical application recommendations.

AI alignmentDPOGRPO

0 likes · 14 min read

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

JD Retail Technology

Feb 28, 2025 · Artificial Intelligence

Generative Recommendation with DPO Alignment for JD Alliance Advertising: Multi‑Objective Optimization and Online Results

The paper presents a generative recommendation framework for JD Alliance advertising that combines semantic‑ID modeling, large‑model pre‑training and fine‑tuning, and Direct Preference Optimization (including Softmax‑DPO and β‑DPO) to jointly boost click‑through and conversion rates, achieving +0.6% UCTR and +8% UCVR in online tests while outlining future multi‑objective extensions.

AdvertisingDPOgenerative recommendation

0 likes · 12 min read

Generative Recommendation with DPO Alignment for JD Alliance Advertising: Multi‑Objective Optimization and Online Results

Bilibili Tech

Jan 14, 2025 · Artificial Intelligence

Technical Practices and Productization of Intelligent Advertising Title Generation for Bilibili

We built an LLM‑powered system for Bilibili that automatically creates ad titles from user keywords, employing fluency, style, and quality classifiers, mixed domain data cleaning, and alignment methods such as SFT, DPO and KTO, resulting in a product that now generates about ten percent of daily titles and drives significant ad spend.

AI alignmentAd Title GenerationBilibili

0 likes · 24 min read

Technical Practices and Productization of Intelligent Advertising Title Generation for Bilibili

Baobao Algorithm Notes

Jan 8, 2025 · Artificial Intelligence

Inside Llama 3.1, DeepSeek‑V3, TÜLU 3 & Qwen 2.5: A Deep Dive into Post‑Training Techniques

This article compiles and analyzes the post‑training pipelines of Llama 3.1, DeepSeek‑V3, TÜLU 3 and Qwen 2.5, detailing their data compositions, SFT, reward modeling, DPO, GRPO, RLVR methods, hyper‑parameters, and practical tricks for large‑language‑model alignment.

DPODeepSeek-V3Llama3.1

0 likes · 22 min read

Inside Llama 3.1, DeepSeek‑V3, TÜLU 3 & Qwen 2.5: A Deep Dive into Post‑Training Techniques

Baobao Algorithm Notes

Nov 19, 2024 · Artificial Intelligence

Demystifying OpenRLHF Loss Functions: From GPTLM to KTO and Beyond

This article walks through the various loss functions used in OpenRLHF—including GPTLMLoss, KDLoss, DPOLoss, KTOLoss, and reward model losses—explaining their mathematical foundations, implementation details, and practical considerations for RLHF training.

DPOKTOLoss Functions

0 likes · 23 min read

Demystifying OpenRLHF Loss Functions: From GPTLM to KTO and Beyond

Baobao Algorithm Notes

Nov 18, 2024 · Artificial Intelligence

Boosting Vision‑Language Model Performance: Prompt‑First vs. Fine‑Tuning Strategies

This guide explains when to rely on prompt engineering versus SFT fine‑tuning for Vision‑Language Models, emphasizing data quality, appropriate dataset sizes, training epochs, hyper‑parameter tuning, and practical steps to build robust VLM pipelines.

AIDPOData Quality

0 likes · 10 min read

Boosting Vision‑Language Model Performance: Prompt‑First vs. Fine‑Tuning Strategies

Baobao Algorithm Notes

Oct 22, 2024 · Artificial Intelligence

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

This article analytically explores the implicit assumptions behind the RLHF optimization objective, examines how they limit DPO and PPO methods, and proposes practical improvements such as rejection sampling and online on‑policy data selection to narrow the gap between theory and practice.

AI alignmentDPOPPO

0 likes · 22 min read

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

Baobao Algorithm Notes

Oct 21, 2024 · Artificial Intelligence

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

This article provides a thorough, four‑part overview of RLHF for large language models, covering preference‑optimization algorithms (PPO‑based and offline RL approaches), reward‑model training techniques, inference‑time exploration strategies, and practical implementation details including the OpenRLHF framework and resource‑allocation tricks.

DPOLLM OptimizationOpenRLHF

0 likes · 27 min read

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

Baobao Algorithm Notes

Oct 15, 2024 · Artificial Intelligence

How DPO Simplifies RLHF: A Deep Dive into Direct Preference Optimization

This article breaks down how Direct Preference Optimization (DPO) mathematically reduces the two‑stage RLHF pipeline into a single‑stage SFT process, explains the underlying loss transformations, and discusses DPO's practical limitations and trade‑offs for large language model alignment.

DPODirect Preference OptimizationRLHF

0 likes · 9 min read

How DPO Simplifies RLHF: A Deep Dive into Direct Preference Optimization

Baobao Algorithm Notes

Sep 28, 2024 · Artificial Intelligence

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

This article provides a thorough, yet concise, overview of Llama 3’s training pipeline, data handling, model architecture, scaling laws, post‑training techniques like SFT and DPO, and inference optimizations such as KV‑Cache, GQA, PagedAttention, and FP8 quantization, highlighting practical insights and benchmark results.

DPOKV cacheLLM training

0 likes · 32 min read

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

NewBeeNLP

Sep 23, 2024 · Artificial Intelligence

Why Post‑Training Is Redefining LLMs: DPO vs PPO, Synthetic Data, and Scaling Strategies

This article analyzes recent post‑training trends in large language models, comparing DPO and PPO, examining the scarcity of open‑source preference data, the iterative training process, the rise of synthetic data pipelines, and emerging methods for improving math and reasoning capabilities.

DPOLLMPPO

0 likes · 12 min read

Why Post‑Training Is Redefining LLMs: DPO vs PPO, Synthetic Data, and Scaling Strategies

Baobao Algorithm Notes

Sep 10, 2024 · Artificial Intelligence

How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models

This article breaks down the mathematical derivation of Direct Preference Optimization (DPO), showing how it replaces the traditional RLHF‑PPO pipeline by directly training an alignment model from human preference data, eliminating the need for a separate reward model and simplifying the overall training process.

DPOLLM alignmentPreference Optimization

0 likes · 17 min read

How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models

NewBeeNLP

Aug 7, 2024 · Artificial Intelligence

Can Intuitive Fine‑Tuning Replace Expensive RLHF and DPO for LLM Alignment?

This article analyses the shortcomings of current large language model training methods such as SFT, RLHF and DPO, explains why they incur high data and compute costs, and introduces Intuitive Fine‑Tuning (IFT) with temporal residual connections as a cheaper yet effective alternative that better aligns training objectives with real generation tasks.

DPOIntuitive Fine-TuningLLM

0 likes · 15 min read

Can Intuitive Fine‑Tuning Replace Expensive RLHF and DPO for LLM Alignment?

Alibaba Cloud Big Data AI Platform

Jul 8, 2024 · Artificial Intelligence

How to Fine‑Tune Qwen2 with Direct Preference Optimization on Alibaba Cloud PAI

This guide explains the Direct Preference Optimization (DPO) algorithm for aligning large language models, demonstrates its advantages over RLHF, and provides a step‑by‑step tutorial on using Alibaba Cloud’s PAI‑QuickStart to fine‑tune the open‑source Qwen2 series, including data preparation, hyper‑parameter settings, training, deployment, and API usage.

AI alignmentAlibaba CloudDPO

0 likes · 14 min read

How to Fine‑Tune Qwen2 with Direct Preference Optimization on Alibaba Cloud PAI

Baobao Algorithm Notes

May 30, 2024 · Artificial Intelligence

What’s the Latest RLHF Landscape? From PPO to ORPO Explained

This article surveys the current RLHF ecosystem, comparing on‑policy methods like PPO with off‑policy approaches such as DPO, and examines recent variants—including ReMax, GRPO, DPOP, TDPO, and ORPO—highlighting their algorithmic differences, resource trade‑offs, and practical performance insights.

DPOLLMPPO

0 likes · 23 min read

What’s the Latest RLHF Landscape? From PPO to ORPO Explained

NewBeeNLP

May 13, 2024 · Artificial Intelligence

Why DPO Treats LLMs as Q‑Functions: A Deep Theoretical Dive

This article offers a detailed theoretical interpretation of the DPO algorithm, showing how large language models can be viewed as Q‑functions, unifying sequence‑wise and step‑wise decision perspectives, and discussing the resulting implications for reinforcement‑learning‑based alignment research.

DPOLLMPreference Optimization

0 likes · 14 min read

Why DPO Treats LLMs as Q‑Functions: A Deep Theoretical Dive