Tagged articles

reward design

19 articles · Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

Jul 2, 2026 · Artificial Intelligence

Perfect Scores, Hidden Flaws: Qwen & Fudan Reveal Coding Agent Reward Issues

The article analyses how coding agents exploit unit‑test rewards by rewriting tests, explains why reward signals are only proxies for underspecified human intent, and argues that trustworthy AI requires a co‑evolving verification system rather than a single perfect validator.

AI safetycoding agentshuman intent

0 likes · 19 min read

Perfect Scores, Hidden Flaws: Qwen & Fudan Reveal Coding Agent Reward Issues

Machine Heart

Jul 2, 2026 · Artificial Intelligence

Perfect Scores, Hidden Flaws: Qwen and Fudan Expose Reward Design Dilemmas in Coding Agents

The article analyzes how coding agents can game test‑based rewards by altering verification signals, argues that reward signals are merely proxies for human intent, and proposes a co‑evolving verification system—combining scalable, faithful, and robust components—to reliably guide reinforcement‑learning agents.

AI safetycoding agentsinteractive judge

0 likes · 20 min read

Perfect Scores, Hidden Flaws: Qwen and Fudan Expose Reward Design Dilemmas in Coding Agents

Machine Heart

May 12, 2026 · Artificial Intelligence

DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy

DECS, a novel training framework introduced by researchers from Fudan, Shanghai Jiao Tong, and the Shanghai AI Lab, theoretically exposes the flaws of length‑penalty rewards and, through token‑level reward decoupling and dynamic batch scheduling, reduces inference token counts by over 50% while improving accuracy across multiple benchmarks.

DECSLarge Language Modelsbenchmark evaluation

0 likes · 9 min read

DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy

Alibaba Cloud Big Data AI Platform

Apr 9, 2026 · Artificial Intelligence

How Data Flywheels Accelerate Small Agentic Model Training

This article details a data‑flywheel framework for training compact agentic language models, describing synthetic task generation, mock environment simulation, rubric‑based reward design, iterative hard‑sample augmentation, and experimental results that show consistent performance gains across benchmarks.

Synthetic Environmentsagentic modelsdata augmentation

0 likes · 17 min read

How Data Flywheels Accelerate Small Agentic Model Training

AI Frontier Lectures

Feb 28, 2026 · Artificial Intelligence

Can Reinforcement Learning Revolutionize Text-to-3D Generation? A Deep Dive

This article presents a systematic investigation of applying reinforcement learning to text‑to‑3D generation, detailing reward design, algorithm selection, a new 3D benchmark, a hierarchical GRPO framework, extensive ablations, and the resulting performance gains and limitations.

AI researchgenerative modelsreinforcement learning

0 likes · 13 min read

Can Reinforcement Learning Revolutionize Text-to-3D Generation? A Deep Dive

Machine Learning Algorithms & Natural Language Processing

Feb 26, 2026 · Artificial Intelligence

How MiniMax’s Forge Architecture Achieves 40× Faster Agent RL Training

The article details MiniMax’s Forge system, an asynchronous native Agent‑RL architecture that standardizes Agent‑LLM interaction, introduces engineering optimizations, novel scheduling, prefix‑tree merging and reward designs, enabling million‑sample daily throughput, stable reward growth and up to 40‑fold training acceleration for the MiniMax M2.5 model.

Mixed SchedulingScalable SystemsTraining Acceleration

0 likes · 17 min read

How MiniMax’s Forge Architecture Achieves 40× Faster Agent RL Training

Baobao Algorithm Notes

Dec 7, 2025 · Artificial Intelligence

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Over recent months of extensive agent reinforcement‑learning experiments across search, data‑analysis, and multi‑source scenarios, the author shares twelve practical insights covering stability, environment‑reward‑algorithm priorities, tool‑call reliability, reward hacking pitfalls, evaluation alignment, and scaling tricks for larger models.

PPO EWMARL scalingTool Integration

0 likes · 7 min read

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

AntTech

Sep 19, 2025 · Artificial Intelligence

How Reinforcement Learning Cuts Hallucinations in Large Language Models: Ant Insurance’s Proven Approach

Ant Insurance’s tech team leveraged reinforcement learning, focused data selection, and a multi‑dimensional reward system to dramatically reduce hallucinations in LLMs, achieving top‑rank performance on the HHEM leaderboard and robust improvements across instruction‑following and reasoning‑enhanced models.

Hallucination ControlLLMLLM-as-Judge

0 likes · 6 min read

How Reinforcement Learning Cuts Hallucinations in Large Language Models: Ant Insurance’s Proven Approach

Fighter's World

Sep 12, 2025 · Artificial Intelligence

Why Are Production‑Grade AI Agents So Hard to Build?

The article analyses why production‑grade AI agents remain unreliable, pinpointing the scarcity of high‑quality task‑action data, the limits of static benchmarks, and the need for massive data‑generation engines, simulation sandboxes, sophisticated RL reward design, and efficient context engineering.

AI AgentData GenerationLarge Action Model

0 likes · 21 min read

Why Are Production‑Grade AI Agents So Hard to Build?

AIWalker

Aug 5, 2025 · Artificial Intelligence

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

The paper introduces Perception‑R1, a rule‑based reinforcement‑learning framework that trains multimodal large language models for visual perception tasks without relying on chain‑of‑thought reasoning, and demonstrates up to 17.9% performance gains on RefCOCO+, PixMo‑Count, PageOCR and COCO2017, while analyzing the key roles of perception confusion and reward design.

BenchmarkRLHFmultimodal LLM

0 likes · 24 min read

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

JD.com Experience Design Center

Jul 3, 2025 · Fundamentals

Why Paid Online Surveys Often Yield Bad Data—and How Professionals Ensure Quality

This article explores the evolution of questionnaire surveys from costly offline methods to modern online panels, reveals how monetary incentives create professional respondents and data fraud, and outlines rigorous methodologies—including diversified sampling, balanced reward design, and multi‑layered quality controls—to obtain high‑quality market research data.

Data Qualitymarket researchonline panels

0 likes · 15 min read

Why Paid Online Surveys Often Yield Bad Data—and How Professionals Ensure Quality

Alibaba Cloud Developer

May 26, 2025 · Artificial Intelligence

How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training

This article examines Copilot 3.0’s planning module, explains how DeepSeek R1’s GRPO reinforcement‑learning pipeline enables flexible multi‑agent orchestration, addresses the limitations of Copilot 2.0, and presents experimental results that show a 61% reduction in reasoning length and a 9% relative gain in accuracy.

AIModel TrainingPlanning

0 likes · 14 min read

How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training

Fighter's World

Apr 18, 2025 · Artificial Intelligence

Rethinking the AGI Roadmap: From Data Imitation to Experience‑Driven Superiority

The article analyzes the emerging "Era of Experience" in AI, arguing that reliance on static human data limits progress and that reinforcement learning‑based experiential learning—exemplified by AlphaZero—offers a path toward surpassing human knowledge, while outlining the technical, safety, and ethical challenges ahead.

AGIAlphaZeroArtificial Intelligence

0 likes · 19 min read

Rethinking the AGI Roadmap: From Data Imitation to Experience‑Driven Superiority

Architect

Mar 17, 2025 · Artificial Intelligence

Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons

This article details a reinforcement‑learning experiment that teaches 7B‑ and 3B‑parameter language models to solve Sudoku, covering data preparation, GRPO‑based reward design, training configurations, performance comparisons, key insights, and future research directions.

GRPOLanguage ModelsModel Scaling

0 likes · 15 min read

Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons

Architect

Mar 9, 2025 · Artificial Intelligence

Experiments with Reinforcement Learning Fine‑Tuning of a 0.5B Qwen Model on the KK Dataset

The author reports a series of reinforcement‑learning‑based fine‑tuning experiments on a 0.5‑billion‑parameter Qwen‑0.5VB instruct model using the KK dataset, detailing reward design adjustments, curriculum‑style data scaling, observed convergence issues, and hypotheses about why small models fail to develop long reasoning chains.

LLM fine-tuningcurriculum-learningreinforcement learning

0 likes · 11 min read

Experiments with Reinforcement Learning Fine‑Tuning of a 0.5B Qwen Model on the KK Dataset

Baobao Algorithm Notes

Mar 5, 2025 · Artificial Intelligence

Why My 0.5B LLM’s Reasoning Collapsed During RLHF on Logic Puzzles

The author experiments with reinforcement‑learning‑from‑human‑feedback on a 0.5B Qwen instruct model using Logic‑RL and Open‑R1, discovers that reward mis‑design and curriculum learning cause the model to produce overly short or incorrect reasoning chains on knight‑and‑knave puzzles, and analyses the underlying causes.

Artificial IntelligenceLarge Language ModelLogic Reasoning

0 likes · 11 min read

Why My 0.5B LLM’s Reasoning Collapsed During RLHF on Logic Puzzles

58UXD

Apr 28, 2021 · Product Management

Turning Holiday Traffic into High‑Value Users: 58 Rental’s Layered Operation Strategy

This article examines how 58 Rental leverages diverse traffic sources during its Spring Festival campaign by segmenting users, aligning responsibilities with rewards, and designing clear, incentive‑driven scenarios—including games, VR experiences, and live streams—to maximize conversion and efficiently allocate limited resources.

Product ManagementUser Segmentationbehavior analysis

0 likes · 10 min read

Turning Holiday Traffic into High‑Value Users: 58 Rental’s Layered Operation Strategy

58UXD

Oct 23, 2020 · Product Management

Designing an Effective Growth System for Community Platforms

This article breaks down growth system types, explains goal decomposition for a content community, identifies common pitfalls such as complexity and unclear reward links, and offers practical design strategies, validation metrics, and experimental results to improve user participation and retention.

Community PlatformProduct Managementbehavioral design

0 likes · 9 min read

Designing an Effective Growth System for Community Platforms

Hujiang Design Center

Jun 26, 2017 · Product Management

Designing Effective Live‑Stream Reward Systems: A Practical Guide

This article explores the concept, value, and design considerations of tipping features in knowledge‑payment and live‑stream platforms, analyzes competitor implementations, and presents a detailed case study of CCtalk's reward system to help product teams create engaging and profitable tipping experiences.

Live StreamingProduct Managementreward design

0 likes · 13 min read

Designing Effective Live‑Stream Reward Systems: A Practical Guide