Tagged articles
21 articles
Page 1 of 1
Machine Heart
Machine Heart
May 18, 2026 · Artificial Intelligence

How LLMs Raised the Steiner Ratio Lower Bound to 0.8559, Closing in on the Gilbert‑Pollak Conjecture

A team from Peking University built an LLM‑driven framework that iteratively generates verification functions and uses a reward model with divide‑and‑conquer to improve the planar Steiner ratio from the long‑standing lower bound of 0.824 to 0.8559, a result accepted at ICML 2026 and verified by human experts.

Gilbert‑Pollak conjectureLLMMathematical AI
0 likes · 9 min read
How LLMs Raised the Steiner Ratio Lower Bound to 0.8559, Closing in on the Gilbert‑Pollak Conjecture
AIWalker
AIWalker
Mar 3, 2026 · Artificial Intelligence

RetouchIQ’s Instruction‑Driven AI Editing Overcomes Traditional Retouching Limits

RetouchIQ introduces an instruction‑driven AI retouching system that uses a general reward model to interpret abstract user commands, delivering precise image adjustments with higher semantic consistency and visual naturalness than existing multimodal large language models, thereby lowering the technical barrier for cinematic‑style edits.

AI Image EditingRetouchIQReward model
0 likes · 3 min read
RetouchIQ’s Instruction‑Driven AI Editing Overcomes Traditional Retouching Limits
AI Algorithm Path
AI Algorithm Path
Jul 27, 2025 · Artificial Intelligence

Understanding RLHF: How Human Feedback Trains Modern LLMs

This article explains the RLHF (Reinforcement Learning from Human Feedback) pipeline that powers ChatGPT and other large language models, covering the limitations of traditional fine‑tuning, the creation of human‑feedback datasets, reward‑model training, loss design, and the final PPO‑based fine‑tuning step.

ChatGPTHuman FeedbackPPO
0 likes · 8 min read
Understanding RLHF: How Human Feedback Trains Modern LLMs
IT Services Circle
IT Services Circle
Jul 16, 2025 · Artificial Intelligence

How a Simple Colon Can Trick Top LLMs – The Master‑RM Fix

A recent study reveals that tiny symbols like colons or generic reasoning prefixes can cause large language models used as reward judges to issue false‑positive rewards, but an enhanced reward model called Master‑RM, trained with adversarial data, eliminates this vulnerability across multiple LLMs and languages.

AI SafetyLLMMaster-RM
0 likes · 10 min read
How a Simple Colon Can Trick Top LLMs – The Master‑RM Fix
JD Tech
JD Tech
Mar 26, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

The JD advertising team proposes a CTR‑driven advertising image generation framework (CAIG) that leverages multimodal large language models, a novel reward model, and product‑centric preference optimization to produce ad images with superior click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward modeladvertising image generation
0 likes · 10 min read
CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)
JD Cloud Developers
JD Cloud Developers
Mar 13, 2025 · Artificial Intelligence

Can Multimodal LLMs Boost Ad Click‑Through Rates? Introducing CTR‑Driven Image Generation

This paper presents a CTR‑driven advertising image generation framework that leverages multimodal large language models, reward modeling, and reinforcement learning to produce product‑centric ad visuals with higher click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward modeladvertising image generation
0 likes · 13 min read
Can Multimodal LLMs Boost Ad Click‑Through Rates? Introducing CTR‑Driven Image Generation
DaTaobao Tech
DaTaobao Tech
Mar 7, 2025 · Artificial Intelligence

Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques

Taobao’s AIGC pipeline combines a human‑feedback multimodal reward model, audio‑visual joint pre‑training, and Mixture‑of‑Experts distillation to clean data, align outputs with user preferences, and achieve state‑of‑the‑art multimodal LLM performance that drives content cold‑start and conversion gains in e‑commerce.

AIGCContent GenerationReward model
0 likes · 10 min read
Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques
JD Tech Talk
JD Tech Talk
Feb 20, 2025 · Artificial Intelligence

Multi‑Agent Architecture for an E‑Commerce Business Assistant: Design, Planning, Evaluation, and Sample Generation

The document describes the evolution, design principles, key technologies, online inference workflow, evaluation methods, and sample‑generation techniques of a large‑language‑model‑based multi‑agent system that powers a 24/7 e‑commerce merchant assistant, highlighting its benefits, challenges, and future work.

AI PlanningLLMMulti-Agent
0 likes · 21 min read
Multi‑Agent Architecture for an E‑Commerce Business Assistant: Design, Planning, Evaluation, and Sample Generation
AIWalker
AIWalker
Feb 4, 2025 · Artificial Intelligence

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

This article reviews a comprehensive study that applies Chain‑of‑Thought reasoning to autoregressive text‑to‑image generation, introducing extended test‑time computation, direct preference optimization, and two custom reward models (PARM and PARM++) that together improve generation quality by up to 15% over Stable Diffusion 3.

Direct Preference OptimizationInferenceMultimodal AI
0 likes · 13 min read
How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme
Fighter's World
Fighter's World
Nov 30, 2024 · Artificial Intelligence

How to Replicate OpenAI’s o1: A Detailed Step‑by‑Step Guide

This article breaks down the replication of OpenAI’s o1 model into four phases—assessment, journey‑learning foundation, component implementation, and training—while highlighting key challenges such as building scalable long‑thought data, reward models, and policy reasoning trees, and discusses the broader impact of o1’s reasoning abilities.

AI reasoningLLM replicationOpenAI o1
0 likes · 18 min read
How to Replicate OpenAI’s o1: A Detailed Step‑by‑Step Guide
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 29, 2024 · Artificial Intelligence

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

This article analyses the training tricks behind OpenAI's o1 model, explaining test/inference‑time scaling laws, post‑training techniques, process‑supervised reward models (PRM), various inference‑time search methods, data‑collection pipelines, and the trade‑offs between allocating compute to pre‑training versus inference.

LLM inferenceOpenAI o1Reward model
0 likes · 34 min read
Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 18, 2024 · Artificial Intelligence

How OpenAI’s o1 Uses Self‑Play RL to Achieve Breakthrough Reasoning

This article provides an in‑depth technical analysis of OpenAI’s new multimodal model o1, explaining its self‑play reinforcement‑learning pipeline, novel train‑time and test‑time scaling laws, inference‑time thinking process, and possible architectural variants, while also discussing broader implications for large‑language‑model research.

OpenAI o1Reward modelinference thinking
0 likes · 37 min read
How OpenAI’s o1 Uses Self‑Play RL to Achieve Breakthrough Reasoning
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Sep 2, 2024 · Artificial Intelligence

How AIGC Transforms Advertising Material Creation on Xiaohongshu

This article analyzes how large‑model AIGC reshapes the production, evaluation, and deployment of advertising creatives on Xiaohongshu, detailing the business motivations, technical pipeline, controllable generation, reward‑model filtering, and experimental results that balance commercial efficiency with community tone.

AIGCAdvertisingControllable Generation
0 likes · 14 min read
How AIGC Transforms Advertising Material Creation on Xiaohongshu
NewBeeNLP
NewBeeNLP
Sep 2, 2024 · Artificial Intelligence

Boosting Large Language Model Math Reasoning: Mixed Instructions, Synthetic Data, and Training Optimizations

This article presents a comprehensive technical walkthrough on enhancing large language model mathematical reasoning by reviewing model architectures, introducing mixed CoT‑PoT instructions, generating and filtering synthetic data, and applying multi‑stage training optimizations such as RFT, PPO, and DPO, with detailed experimental results and Q&A insights.

AIReward modelTraining Optimization
0 likes · 17 min read
Boosting Large Language Model Math Reasoning: Mixed Instructions, Synthetic Data, and Training Optimizations
DataFunTalk
DataFunTalk
Aug 24, 2024 · Artificial Intelligence

Improving the Mathematical Reasoning Ability of Large Language Models: Overview, Mixed Instructions, Synthetic Data, and Training Optimization

This article presents a comprehensive approach to enhancing large language models' mathematical reasoning by reviewing model architectures, introducing mixed CoT‑PoT instructions, generating and filtering synthetic data, and applying multi‑stage training optimizations such as RFT, PPO, and DPO, with detailed experimental results and Q&A.

AIReward modellarge language models
0 likes · 16 min read
Improving the Mathematical Reasoning Ability of Large Language Models: Overview, Mixed Instructions, Synthetic Data, and Training Optimization
NewBeeNLP
NewBeeNLP
Apr 1, 2024 · Artificial Intelligence

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

This article provides a detailed technical walkthrough of Llama 2's Reinforcement Learning with Human Feedback pipeline, covering human preference data collection, reward‑model design and training, iterative fine‑tuning with PPO and rejection sampling, the Ghost Attention technique for multi‑turn consistency, and the resulting experimental evaluations.

Ghost AttentionLlama-2PPO
0 likes · 18 min read
How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 11, 2023 · Artificial Intelligence

Boost Large‑Model Fine‑Tuning with Low‑Cost Data Selection and Construction

The article explains practical techniques for choosing and constructing fine‑tuning data for large language models, covering data diversity through similarity‑based clustering, semi‑supervised filtering with binary classifiers, and uncertainty‑driven sampling using perplexity or reward models to build an efficient, low‑cost pipeline.

Large ModelReward modelactive learning
0 likes · 9 min read
Boost Large‑Model Fine‑Tuning with Low‑Cost Data Selection and Construction
21CTO
21CTO
Jul 23, 2023 · Artificial Intelligence

What Nathan Lambert Reveals About Meta’s Llama 2: Key Insights and Technical Deep‑Dive

This article translates and analyzes Nathan Lambert’s commentary on Meta’s Llama 2 paper, detailing the model’s architecture, training data, RLHF pipeline, reward models, evaluation methods, safety improvements, licensing terms, and the broader implications for open‑source large language models.

Llama-2Meta AIModel Evaluation
0 likes · 22 min read
What Nathan Lambert Reveals About Meta’s Llama 2: Key Insights and Technical Deep‑Dive
DataFunSummit
DataFunSummit
Feb 25, 2023 · Artificial Intelligence

Understanding Reward Model Training in InstructGPT Using Ranking Sequences

This article explains how InstructGPT's reward model is trained by collecting human‑annotated ranking sequences instead of absolute scores, describes the rank‑loss formulation, provides Python code for the model and loss computation, and presents experimental results demonstrating the approach.

InstructGPTPythonRLHF
0 likes · 9 min read
Understanding Reward Model Training in InstructGPT Using Ranking Sequences