Tagged articles
17 articles
Page 1 of 1
Machine Heart
Machine Heart
May 12, 2026 · Artificial Intelligence

DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy

DECS, a novel training framework introduced by researchers from Fudan, Shanghai Jiao Tong, and the Shanghai AI Lab, theoretically exposes the flaws of length‑penalty rewards and, through token‑level reward decoupling and dynamic batch scheduling, reduces inference token counts by over 50% while improving accuracy across multiple benchmarks.

DECSLarge Language Modelsbenchmark evaluation
0 likes · 9 min read
DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 9, 2026 · Artificial Intelligence

How Data Flywheels Accelerate Small Agentic Model Training

This article details a data‑flywheel framework for training compact agentic language models, describing synthetic task generation, mock environment simulation, rubric‑based reward design, iterative hard‑sample augmentation, and experimental results that show consistent performance gains across benchmarks.

Model EvaluationReinforcement LearningSynthetic Environments
0 likes · 17 min read
How Data Flywheels Accelerate Small Agentic Model Training
AI Frontier Lectures
AI Frontier Lectures
Feb 28, 2026 · Artificial Intelligence

Can Reinforcement Learning Revolutionize Text-to-3D Generation? A Deep Dive

This article presents a systematic investigation of applying reinforcement learning to text‑to‑3D generation, detailing reward design, algorithm selection, a new 3D benchmark, a hierarchical GRPO framework, extensive ablations, and the resulting performance gains and limitations.

AI researchGenerative ModelsReinforcement Learning
0 likes · 13 min read
Can Reinforcement Learning Revolutionize Text-to-3D Generation? A Deep Dive
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 26, 2026 · Artificial Intelligence

How MiniMax’s Forge Architecture Achieves 40× Faster Agent RL Training

The article details MiniMax’s Forge system, an asynchronous native Agent‑RL architecture that standardizes Agent‑LLM interaction, introduces engineering optimizations, novel scheduling, prefix‑tree merging and reward designs, enabling million‑sample daily throughput, stable reward growth and up to 40‑fold training acceleration for the MiniMax M2.5 model.

Agent ArchitectureAsynchronous RLMixed Scheduling
0 likes · 17 min read
How MiniMax’s Forge Architecture Achieves 40× Faster Agent RL Training
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 7, 2025 · Artificial Intelligence

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Over recent months of extensive agent reinforcement‑learning experiments across search, data‑analysis, and multi‑source scenarios, the author shares twelve practical insights covering stability, environment‑reward‑algorithm priorities, tool‑call reliability, reward hacking pitfalls, evaluation alignment, and scaling tricks for larger models.

PPO EWMARL scalingReinforcement Learning
0 likes · 7 min read
Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design
AntTech
AntTech
Sep 19, 2025 · Artificial Intelligence

How Reinforcement Learning Cuts Hallucinations in Large Language Models: Ant Insurance’s Proven Approach

Ant Insurance’s tech team leveraged reinforcement learning, focused data selection, and a multi‑dimensional reward system to dramatically reduce hallucinations in LLMs, achieving top‑rank performance on the HHEM leaderboard and robust improvements across instruction‑following and reasoning‑enhanced models.

Hallucination ControlLLMLLM-as-judge
0 likes · 6 min read
How Reinforcement Learning Cuts Hallucinations in Large Language Models: Ant Insurance’s Proven Approach
Fighter's World
Fighter's World
Sep 12, 2025 · Artificial Intelligence

Why Are Production‑Grade AI Agents So Hard to Build?

The article analyses why production‑grade AI agents remain unreliable, pinpointing the scarcity of high‑quality task‑action data, the limits of static benchmarks, and the need for massive data‑generation engines, simulation sandboxes, sophisticated RL reward design, and efficient context engineering.

AI AgentContext EngineeringData Generation
0 likes · 21 min read
Why Are Production‑Grade AI Agents So Hard to Build?
AIWalker
AIWalker
Aug 5, 2025 · Artificial Intelligence

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

The paper introduces Perception‑R1, a rule‑based reinforcement‑learning framework that trains multimodal large language models for visual perception tasks without relying on chain‑of‑thought reasoning, and demonstrates up to 17.9% performance gains on RefCOCO+, PixMo‑Count, PageOCR and COCO2017, while analyzing the key roles of perception confusion and reward design.

RLHFReinforcement LearningVisual Perception
0 likes · 24 min read
Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks
JD.com Experience Design Center
JD.com Experience Design Center
Jul 3, 2025 · Fundamentals

Why Paid Online Surveys Often Yield Bad Data—and How Professionals Ensure Quality

This article explores the evolution of questionnaire surveys from costly offline methods to modern online panels, reveals how monetary incentives create professional respondents and data fraud, and outlines rigorous methodologies—including diversified sampling, balanced reward design, and multi‑layered quality controls—to obtain high‑quality market research data.

Data Qualitymarket researchonline panels
0 likes · 15 min read
Why Paid Online Surveys Often Yield Bad Data—and How Professionals Ensure Quality
Alibaba Cloud Developer
Alibaba Cloud Developer
May 26, 2025 · Artificial Intelligence

How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training

This article examines Copilot 3.0’s planning module, explains how DeepSeek R1’s GRPO reinforcement‑learning pipeline enables flexible multi‑agent orchestration, addresses the limitations of Copilot 2.0, and presents experimental results that show a 61% reduction in reasoning length and a 9% relative gain in accuracy.

Model TrainingMulti-AgentPlanning
0 likes · 14 min read
How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training
Fighter's World
Fighter's World
Apr 18, 2025 · Artificial Intelligence

Rethinking the AGI Roadmap: From Data Imitation to Experience‑Driven Superiority

The article analyzes the emerging "Era of Experience" in AI, arguing that reliance on static human data limits progress and that reinforcement learning‑based experiential learning—exemplified by AlphaZero—offers a path toward surpassing human knowledge, while outlining the technical, safety, and ethical challenges ahead.

AGIAlphaZeroArtificial Intelligence
0 likes · 19 min read
Rethinking the AGI Roadmap: From Data Imitation to Experience‑Driven Superiority
Architect
Architect
Mar 17, 2025 · Artificial Intelligence

Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons

This article details a reinforcement‑learning experiment that teaches 7B‑ and 3B‑parameter language models to solve Sudoku, covering data preparation, GRPO‑based reward design, training configurations, performance comparisons, key insights, and future research directions.

GRPOModel ScalingReinforcement Learning
0 likes · 15 min read
Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons
Architect
Architect
Mar 9, 2025 · Artificial Intelligence

Experiments with Reinforcement Learning Fine‑Tuning of a 0.5B Qwen Model on the KK Dataset

The author reports a series of reinforcement‑learning‑based fine‑tuning experiments on a 0.5‑billion‑parameter Qwen‑0.5VB instruct model using the KK dataset, detailing reward design adjustments, curriculum‑style data scaling, observed convergence issues, and hypotheses about why small models fail to develop long reasoning chains.

LLM fine-tuningReinforcement Learningcurriculum learning
0 likes · 11 min read
Experiments with Reinforcement Learning Fine‑Tuning of a 0.5B Qwen Model on the KK Dataset
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 5, 2025 · Artificial Intelligence

Why My 0.5B LLM’s Reasoning Collapsed During RLHF on Logic Puzzles

The author experiments with reinforcement‑learning‑from‑human‑feedback on a 0.5B Qwen instruct model using Logic‑RL and Open‑R1, discovers that reward mis‑design and curriculum learning cause the model to produce overly short or incorrect reasoning chains on knight‑and‑knave puzzles, and analyses the underlying causes.

Artificial IntelligenceLogic ReasoningRLHF
0 likes · 11 min read
Why My 0.5B LLM’s Reasoning Collapsed During RLHF on Logic Puzzles
58UXD
58UXD
Apr 28, 2021 · Product Management

Turning Holiday Traffic into High‑Value Users: 58 Rental’s Layered Operation Strategy

This article examines how 58 Rental leverages diverse traffic sources during its Spring Festival campaign by segmenting users, aligning responsibilities with rewards, and designing clear, incentive‑driven scenarios—including games, VR experiences, and live streams—to maximize conversion and efficiently allocate limited resources.

User Segmentationbehavior analysisoperation strategy
0 likes · 10 min read
Turning Holiday Traffic into High‑Value Users: 58 Rental’s Layered Operation Strategy
58UXD
58UXD
Oct 23, 2020 · Product Management

Designing an Effective Growth System for Community Platforms

This article breaks down growth system types, explains goal decomposition for a content community, identifies common pitfalls such as complexity and unclear reward links, and offers practical design strategies, validation metrics, and experimental results to improve user participation and retention.

Community Platformbehavioral designgrowth system
0 likes · 9 min read
Designing an Effective Growth System for Community Platforms
Hujiang Design Center
Hujiang Design Center
Jun 26, 2017 · Product Management

Designing Effective Live‑Stream Reward Systems: A Practical Guide

This article explores the concept, value, and design considerations of tipping features in knowledge‑payment and live‑stream platforms, analyzes competitor implementations, and presents a detailed case study of CCtalk's reward system to help product teams create engaging and profitable tipping experiences.

live streamingproduct-managementreward design
0 likes · 13 min read
Designing Effective Live‑Stream Reward Systems: A Practical Guide