Tagged articles

21 articles

Page 1 of 1

May 18, 2026 · Artificial Intelligence

How LLMs Raised the Steiner Ratio Lower Bound to 0.8559, Closing in on the Gilbert‑Pollak Conjecture

A team from Peking University built an LLM‑driven framework that iteratively generates verification functions and uses a reward model with divide‑and‑conquer to improve the planar Steiner ratio from the long‑standing lower bound of 0.824 to 0.8559, a result accepted at ICML 2026 and verified by human experts.

Gilbert‑Pollak conjectureLLMMathematical AI

0 likes · 9 min read

How LLMs Raised the Steiner Ratio Lower Bound to 0.8559, Closing in on the Gilbert‑Pollak Conjecture

AIWalker

Mar 3, 2026 · Artificial Intelligence

RetouchIQ’s Instruction‑Driven AI Editing Overcomes Traditional Retouching Limits

RetouchIQ introduces an instruction‑driven AI retouching system that uses a general reward model to interpret abstract user commands, delivering precise image adjustments with higher semantic consistency and visual naturalness than existing multimodal large language models, thereby lowering the technical barrier for cinematic‑style edits.

AI Image EditingRetouchIQReward model

0 likes · 3 min read

RetouchIQ’s Instruction‑Driven AI Editing Overcomes Traditional Retouching Limits

AI Algorithm Path

Jul 27, 2025 · Artificial Intelligence

Understanding RLHF: How Human Feedback Trains Modern LLMs

This article explains the RLHF (Reinforcement Learning from Human Feedback) pipeline that powers ChatGPT and other large language models, covering the limitations of traditional fine‑tuning, the creation of human‑feedback datasets, reward‑model training, loss design, and the final PPO‑based fine‑tuning step.

ChatGPTHuman FeedbackPPO

0 likes · 8 min read

Understanding RLHF: How Human Feedback Trains Modern LLMs

IT Services Circle

Jul 16, 2025 · Artificial Intelligence

How a Simple Colon Can Trick Top LLMs – The Master‑RM Fix

A recent study reveals that tiny symbols like colons or generic reasoning prefixes can cause large language models used as reward judges to issue false‑positive rewards, but an enhanced reward model called Master‑RM, trained with adversarial data, eliminates this vulnerability across multiple LLMs and languages.

AI SafetyLLMMaster-RM

0 likes · 10 min read

How a Simple Colon Can Trick Top LLMs – The Master‑RM Fix

JD Tech

Mar 26, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

The JD advertising team proposes a CTR‑driven advertising image generation framework (CAIG) that leverages multimodal large language models, a novel reward model, and product‑centric preference optimization to produce ad images with superior click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward modeladvertising image generation

0 likes · 10 min read

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

JD Cloud Developers

Mar 13, 2025 · Artificial Intelligence

Can Multimodal LLMs Boost Ad Click‑Through Rates? Introducing CTR‑Driven Image Generation

This paper presents a CTR‑driven advertising image generation framework that leverages multimodal large language models, reward modeling, and reinforcement learning to produce product‑centric ad visuals with higher click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward modeladvertising image generation

0 likes · 13 min read

Can Multimodal LLMs Boost Ad Click‑Through Rates? Introducing CTR‑Driven Image Generation

DaTaobao Tech

Mar 7, 2025 · Artificial Intelligence

Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques

Taobao’s AIGC pipeline combines a human‑feedback multimodal reward model, audio‑visual joint pre‑training, and Mixture‑of‑Experts distillation to clean data, align outputs with user preferences, and achieve state‑of‑the‑art multimodal LLM performance that drives content cold‑start and conversion gains in e‑commerce.

AIGCContent GenerationReward model

0 likes · 10 min read

Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques

JD Tech Talk

Feb 20, 2025 · Artificial Intelligence

Multi‑Agent Architecture for an E‑Commerce Business Assistant: Design, Planning, Evaluation, and Sample Generation

The document describes the evolution, design principles, key technologies, online inference workflow, evaluation methods, and sample‑generation techniques of a large‑language‑model‑based multi‑agent system that powers a 24/7 e‑commerce merchant assistant, highlighting its benefits, challenges, and future work.

AI PlanningLLMMulti-Agent

0 likes · 21 min read

Multi‑Agent Architecture for an E‑Commerce Business Assistant: Design, Planning, Evaluation, and Sample Generation

AIWalker

Feb 4, 2025 · Artificial Intelligence

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

This article reviews a comprehensive study that applies Chain‑of‑Thought reasoning to autoregressive text‑to‑image generation, introducing extended test‑time computation, direct preference optimization, and two custom reward models (PARM and PARM++) that together improve generation quality by up to 15% over Stable Diffusion 3.

Direct Preference OptimizationInferenceMultimodal AI

0 likes · 13 min read

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

NewBeeNLP

Dec 3, 2024 · Artificial Intelligence

Can LLMs Self‑Correct Their Answers? Exploring Reward Models, Loss Functions, and Training Dynamics

The article reflects on open‑source LLMs like Qwen2 and Llama 3.1, questioning whether models should self‑review answers, how hidden states might signal uncertainty, the role of loss‑function design, scaling laws, and the trade‑offs between PPO and DPO in alignment.

Reward modellarge language modelsloss function

0 likes · 9 min read

Can LLMs Self‑Correct Their Answers? Exploring Reward Models, Loss Functions, and Training Dynamics

Fighter's World

Nov 30, 2024 · Artificial Intelligence

How to Replicate OpenAI’s o1: A Detailed Step‑by‑Step Guide

This article breaks down the replication of OpenAI’s o1 model into four phases—assessment, journey‑learning foundation, component implementation, and training—while highlighting key challenges such as building scalable long‑thought data, reward models, and policy reasoning trees, and discusses the broader impact of o1’s reasoning abilities.

AI reasoningLLM replicationOpenAI o1

0 likes · 18 min read

How to Replicate OpenAI’s o1: A Detailed Step‑by‑Step Guide

Baobao Algorithm Notes

Oct 7, 2024 · Artificial Intelligence

Decoding OpenAI’s o1: How RL and Process‑Supervised Reward Models Might Power the Next LLM

The author speculates on OpenAI’s o1 architecture, proposing that it relies on reinforcement learning guided by a generalizable, process‑supervised reward model, and outlines data collection, multi‑model generation, and training tweaks needed to realize such a system.

AI researchLLMRLHF

0 likes · 8 min read

Decoding OpenAI’s o1: How RL and Process‑Supervised Reward Models Might Power the Next LLM

Baobao Algorithm Notes

Sep 29, 2024 · Artificial Intelligence

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

This article analyses the training tricks behind OpenAI's o1 model, explaining test/inference‑time scaling laws, post‑training techniques, process‑supervised reward models (PRM), various inference‑time search methods, data‑collection pipelines, and the trade‑offs between allocating compute to pre‑training versus inference.

LLM inferenceOpenAI o1Reward model

0 likes · 34 min read

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

Baobao Algorithm Notes

Sep 18, 2024 · Artificial Intelligence

How OpenAI’s o1 Uses Self‑Play RL to Achieve Breakthrough Reasoning

This article provides an in‑depth technical analysis of OpenAI’s new multimodal model o1, explaining its self‑play reinforcement‑learning pipeline, novel train‑time and test‑time scaling laws, inference‑time thinking process, and possible architectural variants, while also discussing broader implications for large‑language‑model research.

OpenAI o1Reward modelinference thinking

0 likes · 37 min read

How OpenAI’s o1 Uses Self‑Play RL to Achieve Breakthrough Reasoning

Xiaohongshu Tech REDtech

Sep 2, 2024 · Artificial Intelligence

How AIGC Transforms Advertising Material Creation on Xiaohongshu

This article analyzes how large‑model AIGC reshapes the production, evaluation, and deployment of advertising creatives on Xiaohongshu, detailing the business motivations, technical pipeline, controllable generation, reward‑model filtering, and experimental results that balance commercial efficiency with community tone.

AIGCAdvertisingControllable Generation

0 likes · 14 min read

How AIGC Transforms Advertising Material Creation on Xiaohongshu

NewBeeNLP

Sep 2, 2024 · Artificial Intelligence

Boosting Large Language Model Math Reasoning: Mixed Instructions, Synthetic Data, and Training Optimizations

This article presents a comprehensive technical walkthrough on enhancing large language model mathematical reasoning by reviewing model architectures, introducing mixed CoT‑PoT instructions, generating and filtering synthetic data, and applying multi‑stage training optimizations such as RFT, PPO, and DPO, with detailed experimental results and Q&A insights.

AIReward modelTraining Optimization

0 likes · 17 min read

Boosting Large Language Model Math Reasoning: Mixed Instructions, Synthetic Data, and Training Optimizations

DataFunTalk

Aug 24, 2024 · Artificial Intelligence

Improving the Mathematical Reasoning Ability of Large Language Models: Overview, Mixed Instructions, Synthetic Data, and Training Optimization

This article presents a comprehensive approach to enhancing large language models' mathematical reasoning by reviewing model architectures, introducing mixed CoT‑PoT instructions, generating and filtering synthetic data, and applying multi‑stage training optimizations such as RFT, PPO, and DPO, with detailed experimental results and Q&A.

AIReward modellarge language models

0 likes · 16 min read

Improving the Mathematical Reasoning Ability of Large Language Models: Overview, Mixed Instructions, Synthetic Data, and Training Optimization

NewBeeNLP

Apr 1, 2024 · Artificial Intelligence

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

This article provides a detailed technical walkthrough of Llama 2's Reinforcement Learning with Human Feedback pipeline, covering human preference data collection, reward‑model design and training, iterative fine‑tuning with PPO and rejection sampling, the Ghost Attention technique for multi‑turn consistency, and the resulting experimental evaluations.

Ghost AttentionLlama-2PPO

0 likes · 18 min read

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

Baobao Algorithm Notes

Dec 11, 2023 · Artificial Intelligence

Boost Large‑Model Fine‑Tuning with Low‑Cost Data Selection and Construction

The article explains practical techniques for choosing and constructing fine‑tuning data for large language models, covering data diversity through similarity‑based clustering, semi‑supervised filtering with binary classifiers, and uncertainty‑driven sampling using perplexity or reward models to build an efficient, low‑cost pipeline.

Large ModelReward modelactive learning

0 likes · 9 min read

Boost Large‑Model Fine‑Tuning with Low‑Cost Data Selection and Construction

21CTO

Jul 23, 2023 · Artificial Intelligence

What Nathan Lambert Reveals About Meta’s Llama 2: Key Insights and Technical Deep‑Dive

This article translates and analyzes Nathan Lambert’s commentary on Meta’s Llama 2 paper, detailing the model’s architecture, training data, RLHF pipeline, reward models, evaluation methods, safety improvements, licensing terms, and the broader implications for open‑source large language models.

Llama-2Meta AIModel Evaluation

0 likes · 22 min read

What Nathan Lambert Reveals About Meta’s Llama 2: Key Insights and Technical Deep‑Dive

DataFunSummit

Feb 25, 2023 · Artificial Intelligence

Understanding Reward Model Training in InstructGPT Using Ranking Sequences

This article explains how InstructGPT's reward model is trained by collecting human‑annotated ranking sequences instead of absolute scores, describes the rank‑loss formulation, provides Python code for the model and loss computation, and presents experimental results demonstrating the approach.

InstructGPTPythonRLHF

0 likes · 9 min read

Understanding Reward Model Training in InstructGPT Using Ranking Sequences