Tagged articles
31 articles
Page 1 of 1
PaperAgent
PaperAgent
May 6, 2026 · Artificial Intelligence

How to Detect Introspective Awareness in LLMs – Boosting Detection Rates by 53% and 75%

Anthropic and MIT researchers reveal that large language models can sense injected steering vectors, a capability that emerges during post‑training (especially DPO), and they present a two‑stage detection circuit whose performance improves by up to 75% when reject directions are ablated or bias vectors are trained.

Circuit AnalysisDPOIntrospective Awareness
0 likes · 15 min read
How to Detect Introspective Awareness in LLMs – Boosting Detection Rates by 53% and 75%
Weekly Large Model Application
Weekly Large Model Application
May 5, 2026 · Artificial Intelligence

Understanding Preference Alignment: Why Voice Output Needs an Extra Layer

The article explains that after task alignment, teams can produce functional demos, but true competitiveness requires preference alignment—optimizing for human comfort across dimensions like brevity, tone, and safety—and discusses how RLHF and DPO address this, especially the additional challenges of generating natural, responsive voice output.

AI AlignmentDPOHuman Feedback
0 likes · 7 min read
Understanding Preference Alignment: Why Voice Output Needs an Extra Layer
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 2, 2026 · Artificial Intelligence

Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution

This article explains the full lifecycle of large language models in 2026, covering pretraining fundamentals, the limits of classic Scaling Laws, data‑centric advances, fine‑tuning strategies, RLHF, DPO, and the emerging post‑training methods GRPO, DAPO and RLVR, with concrete benchmarks and cost analyses.

DAPODPOFine-tuning
0 likes · 17 min read
Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 27, 2026 · Artificial Intelligence

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

This guide walks through setting up a Kunlun P800 XPU host, preparing Docker containers, deploying Qwen3‑8B/‑32B/‑VL models with vLLM‑Kunlun, benchmarking performance, and running full‑parameter DPO training using LLaMA‑Factory, providing scripts, configuration files, and troubleshooting tips for AI engineers.

DPOInferenceKunlun P800
0 likes · 32 min read
Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide
Amap Tech
Amap Tech
Nov 19, 2025 · Artificial Intelligence

How Gaode’s Spacetime‑GR Model Boosts POI Recommendation with AI‑Powered SFT and DPO

Gaode transforms its map app into a dynamic, AI‑driven “living map” by fine‑tuning the large Spacetime‑GR model through embedding‑based and generative ranking SFT, DPO alignment, and multimodal augmentation, achieving significant offline CTR‑AUC improvements and online CTR gains in POI recommendation.

AI recommendationDPOSFT
0 likes · 12 min read
How Gaode’s Spacetime‑GR Model Boosts POI Recommendation with AI‑Powered SFT and DPO
Zhuanzhuan Tech
Zhuanzhuan Tech
Oct 29, 2025 · Artificial Intelligence

How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems

This article examines how reinforcement‑learning techniques such as PPO, DPO, and GRPO are integrated into the Baixiaosheng QA system to improve answer stability, deepen domain knowledge understanding, and accelerate response generation, and it evaluates the impact of Reinforcement Fine‑Tuning (RFT) on real‑world performance.

AIDPOGRPO
0 likes · 16 min read
How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 23, 2025 · Artificial Intelligence

How to Distill Large Language Models for Efficient Text Generation with EasyDistill

This guide explains how to use the EasyDistill framework and Alibaba Cloud PAI to distill large language models for high‑quality text generation, covering model deployment, SFT and DPO training data construction, code examples, configuration files, and best practices for achieving resource‑efficient, high‑performance student models.

DPOEasyDistillPAI
0 likes · 14 min read
How to Distill Large Language Models for Efficient Text Generation with EasyDistill
DataFunTalk
DataFunTalk
Jun 29, 2025 · Artificial Intelligence

Large Models Boost Douyin User Experience: Expert Insights

In an interview at the DA Digital Intelligence Conference, ByteDance AI specialist Cai Conghuai explains how large language models, combined with techniques like SFT, DPO, and RAG, are reshaping Douyin's user‑experience signal detection, root‑cause analysis, and evaluation, while outlining future AI‑agent breakthroughs.

AIDPORAG
0 likes · 12 min read
Large Models Boost Douyin User Experience: Expert Insights
DataFunSummit
DataFunSummit
Jun 22, 2025 · Artificial Intelligence

How Vivo’s BlueHeart AI Assistant Optimizes Post‑Conversation Recommendations with LLMs

In a detailed interview, Vivo AI engineer Liang Tianan explains how the BlueHeart Small V assistant leverages large language models, multi‑stage recall, ranking, and reward‑model fine‑tuning (SFT/DPO) to generate high‑quality, diverse post‑dialogue recommendation items while balancing latency, cost, and evaluation challenges.

DPOLLMSFT
0 likes · 15 min read
How Vivo’s BlueHeart AI Assistant Optimizes Post‑Conversation Recommendations with LLMs
DaTaobao Tech
DaTaobao Tech
Jun 4, 2025 · Artificial Intelligence

Understanding Large Language Model Architecture, Parameters, Memory, Storage, and Fine‑Tuning Techniques

This article provides a comprehensive overview of large language models (LLMs), covering their transformer architecture, parameter counts, GPU memory and storage requirements, and detailed fine‑tuning methods such as prompt engineering, data construction, LoRA, PEFT, RLHF, and DPO, along with practical deployment and inference acceleration strategies.

DPOFine-tuningLLM
0 likes · 17 min read
Understanding Large Language Model Architecture, Parameters, Memory, Storage, and Fine‑Tuning Techniques
Data Thinking Notes
Data Thinking Notes
Mar 16, 2025 · Artificial Intelligence

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

DeepSeek‑R1 replaces the traditional PPO‑based RLHF approach with GRPO, reducing reliance on human‑labeled data by using pure reinforcement learning environments and carefully designed reward mechanisms; the article explains reinforcement learning fundamentals, compares PPO, DPO and GRPO, and offers practical application recommendations.

AI AlignmentDPOGRPO
0 likes · 14 min read
Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives
JD Retail Technology
JD Retail Technology
Feb 28, 2025 · Artificial Intelligence

Generative Recommendation with DPO Alignment for JD Alliance Advertising: Multi‑Objective Optimization and Online Results

The paper presents a generative recommendation framework for JD Alliance advertising that combines semantic‑ID modeling, large‑model pre‑training and fine‑tuning, and Direct Preference Optimization (including Softmax‑DPO and β‑DPO) to jointly boost click‑through and conversion rates, achieving +0.6% UCTR and +8% UCVR in online tests while outlining future multi‑objective extensions.

AdvertisingDPOGenerative Recommendation
0 likes · 12 min read
Generative Recommendation with DPO Alignment for JD Alliance Advertising: Multi‑Objective Optimization and Online Results
Bilibili Tech
Bilibili Tech
Jan 14, 2025 · Artificial Intelligence

Technical Practices and Productization of Intelligent Advertising Title Generation for Bilibili

We built an LLM‑powered system for Bilibili that automatically creates ad titles from user keywords, employing fluency, style, and quality classifiers, mixed domain data cleaning, and alignment methods such as SFT, DPO and KTO, resulting in a product that now generates about ten percent of daily titles and drives significant ad spend.

AI AlignmentAd Title GenerationBilibili
0 likes · 24 min read
Technical Practices and Productization of Intelligent Advertising Title Generation for Bilibili
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 22, 2024 · Artificial Intelligence

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

This article analytically explores the implicit assumptions behind the RLHF optimization objective, examines how they limit DPO and PPO methods, and proposes practical improvements such as rejection sampling and online on‑policy data selection to narrow the gap between theory and practice.

AI AlignmentDPOPPO
0 likes · 22 min read
Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 21, 2024 · Artificial Intelligence

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

This article provides a thorough, four‑part overview of RLHF for large language models, covering preference‑optimization algorithms (PPO‑based and offline RL approaches), reward‑model training techniques, inference‑time exploration strategies, and practical implementation details including the OpenRLHF framework and resource‑allocation tricks.

DPOLLM optimizationOpenRLHF
0 likes · 27 min read
Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 15, 2024 · Artificial Intelligence

How DPO Simplifies RLHF: A Deep Dive into Direct Preference Optimization

This article breaks down how Direct Preference Optimization (DPO) mathematically reduces the two‑stage RLHF pipeline into a single‑stage SFT process, explains the underlying loss transformations, and discusses DPO's practical limitations and trade‑offs for large language model alignment.

DPODirect Preference OptimizationRLHF
0 likes · 9 min read
How DPO Simplifies RLHF: A Deep Dive into Direct Preference Optimization
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 28, 2024 · Artificial Intelligence

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

This article provides a thorough, yet concise, overview of Llama 3’s training pipeline, data handling, model architecture, scaling laws, post‑training techniques like SFT and DPO, and inference optimizations such as KV‑Cache, GQA, PagedAttention, and FP8 quantization, highlighting practical insights and benchmark results.

DPOInferenceKV cache
0 likes · 32 min read
Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization
NewBeeNLP
NewBeeNLP
Aug 7, 2024 · Artificial Intelligence

Can Intuitive Fine‑Tuning Replace Expensive RLHF and DPO for LLM Alignment?

This article analyses the shortcomings of current large language model training methods such as SFT, RLHF and DPO, explains why they incur high data and compute costs, and introduces Intuitive Fine‑Tuning (IFT) with temporal residual connections as a cheaper yet effective alternative that better aligns training objectives with real generation tasks.

DPOIntuitive Fine-TuningLLM
0 likes · 15 min read
Can Intuitive Fine‑Tuning Replace Expensive RLHF and DPO for LLM Alignment?
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 8, 2024 · Artificial Intelligence

How to Fine‑Tune Qwen2 with Direct Preference Optimization on Alibaba Cloud PAI

This guide explains the Direct Preference Optimization (DPO) algorithm for aligning large language models, demonstrates its advantages over RLHF, and provides a step‑by‑step tutorial on using Alibaba Cloud’s PAI‑QuickStart to fine‑tune the open‑source Qwen2 series, including data preparation, hyper‑parameter settings, training, deployment, and API usage.

AI AlignmentAlibaba CloudDPO
0 likes · 14 min read
How to Fine‑Tune Qwen2 with Direct Preference Optimization on Alibaba Cloud PAI
Baobao Algorithm Notes
Baobao Algorithm Notes
May 30, 2024 · Artificial Intelligence

What’s the Latest RLHF Landscape? From PPO to ORPO Explained

This article surveys the current RLHF ecosystem, comparing on‑policy methods like PPO with off‑policy approaches such as DPO, and examines recent variants—including ReMax, GRPO, DPOP, TDPO, and ORPO—highlighting their algorithmic differences, resource trade‑offs, and practical performance insights.

AlignmentDPOLLM
0 likes · 23 min read
What’s the Latest RLHF Landscape? From PPO to ORPO Explained
NewBeeNLP
NewBeeNLP
May 13, 2024 · Artificial Intelligence

Why DPO Treats LLMs as Q‑Functions: A Deep Theoretical Dive

This article offers a detailed theoretical interpretation of the DPO algorithm, showing how large language models can be viewed as Q‑functions, unifying sequence‑wise and step‑wise decision perspectives, and discussing the resulting implications for reinforcement‑learning‑based alignment research.

DPOLLMQ-Function
0 likes · 14 min read
Why DPO Treats LLMs as Q‑Functions: A Deep Theoretical Dive