Tagged articles

reinforcement learning

743 articles · Page 2 of 8

May 7, 2026 · Artificial Intelligence

Dual‑Phase RL‑LLM Framework DARA for Few‑Shot Online Advertising Budget Allocation

The DARA framework splits online advertising budget allocation into a few‑shot LLM reasoning stage and a fine‑grained optimizer stage, enhanced by a dynamically updated RL‑fine‑tuning algorithm (GRPO‑Adaptive), achieving significantly lower ROI variance than traditional baselines in both real and simulated environments.

LLMOnline Advertisingbudget allocation

0 likes · 16 min read

Dual‑Phase RL‑LLM Framework DARA for Few‑Shot Online Advertising Budget Allocation

PaperAgent

May 7, 2026 · Artificial Intelligence

190 Must-Read AI Agent Papers + 321 Google Implementation Cases – Free Resource Pack

The article provides a free compiled resource containing 190 essential AI Agent papers—from fundamentals to cutting‑edge topics—along with 321 Google‑released implementation cases and 500 open‑source agent applications, all with source code to help beginners and researchers quickly understand the field and reproduce results.

AI AgentLLMResearch Papers

0 likes · 6 min read

190 Must-Read AI Agent Papers + 321 Google Implementation Cases – Free Resource Pack

Machine Heart

May 6, 2026 · Artificial Intelligence

Can Adaptive Guidance Unlock Small Model Reasoning? Introducing G²RPO‑A

The paper identifies reward sparsity as the core obstacle for small language models in reinforcement‑learning‑based reasoning, proposes G²RPO‑A which injects high‑quality thinking trajectories and dynamically adjusts guidance length, and demonstrates large accuracy gains on math and code benchmarks such as Qwen3‑1.7B improving from 50.96 % to 67.21 % on MATH500 and from 46.08 % to 75.93 % on HumanEval.

G²RPO‑Aadaptive guidancecode generation

0 likes · 10 min read

Can Adaptive Guidance Unlock Small Model Reasoning? Introducing G²RPO‑A

Machine Heart

May 6, 2026 · Artificial Intelligence

PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels

PromptEcho computes a continuous reward for text‑to‑image generation by measuring how well a frozen vision‑language model can reconstruct the original prompt from the generated image, eliminating the need for annotated data or a trained reward model and outperforming prior methods across multiple benchmarks.

PromptEchoReward Modelingbenchmark

0 likes · 10 min read

PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels

Machine Learning Algorithms & Natural Language Processing

May 5, 2026 · Artificial Intelligence

LLMBeginner: A Project‑Based Roadmap for Zero‑Base Mastery of Large Language Models

The LLMBeginner project from the MLNLP community offers a staged, project‑oriented learning path—covering big‑picture concepts, deep learning and reinforcement learning fundamentals, LLM theory and practice, and agent development—to guide beginners from fragmented resources to systematic mastery, with both concise and detailed versions hosted on GitHub.

AgentGitHubLLM

0 likes · 5 min read

LLMBeginner: A Project‑Based Roadmap for Zero‑Base Mastery of Large Language Models

Data Party THU

May 4, 2026 · Artificial Intelligence

Understanding the Mathematical Foundations of Reinforcement Learning

This article provides a concise overview of a ten‑chapter reinforcement‑learning textbook, outlining the progression from basic concepts such as states and rewards to advanced algorithms like policy gradients and actor‑critic methods, and explains how each chapter builds on the previous ones.

Bellman equationMonte Carloactor-critic

0 likes · 11 min read

Understanding the Mathematical Foundations of Reinforcement Learning

Machine Learning Algorithms & Natural Language Processing

May 2, 2026 · Artificial Intelligence

Real-World Large-Scale Test Shows Robots Learning While Deploying Outperform Baselines on Eight Tasks

The article presents the LWD (Learning While Deploying) framework, detailing its reinforcement‑learning‑driven data flywheel, the DIVL value‑evaluation and QAM policy‑optimization modules, and experimental results where a dual‑arm robot improves success rates by up to 17% and reduces cycle time by 23.75 seconds across eight real‑world tasks, surpassing strong baselines.

DIVLData FlywheelLWD

0 likes · 12 min read

Real-World Large-Scale Test Shows Robots Learning While Deploying Outperform Baselines on Eight Tasks

AI Explorer

May 2, 2026 · Industry Insights

AI Industry Highlights May 2, 2026: Funding Surge, New Tools, and Research Breakthroughs

In May 2026, the AI sector saw a 77% rise in capital spending by the four biggest tech firms, Meta's acquisition of robot startup ARI, reinforcement‑learning advances boosting LLM inference, OpenAI's ChatGPT Images 2.0 launch, Tencent's Hy‑MT model outperforming Google, Microsoft's legal‑AI assistant, a 400B model running on iPhone, and notable research from CMU and independent scholars.

AI InvestmentCMU researchMeta

0 likes · 5 min read

AI Industry Highlights May 2, 2026: Funding Surge, New Tools, and Research Breakthroughs

Machine Heart

May 1, 2026 · Artificial Intelligence

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

This article surveys the rapid evolution of reinforcement‑learning algorithms for large‑language‑model inference from early REINFORCE and PPO to newer approaches such as GRPO, RLOO, DAPO, CISPO, DPPO, ScaleRL and MaxRL, highlighting their design motivations, mathematical formulations, empirical trade‑offs and open research challenges.

GRPOLLMMaxRL

0 likes · 27 min read

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

Machine Heart

Apr 30, 2026 · Artificial Intelligence

Why GPT‑5 Models Keep Talking About Goblins: RL Reward Leakage Uncovered

The article analyzes how DeepSeek’s "极" bug and OpenAI’s recurring "goblin" output stem from unclean training data and an unintended reinforcement‑learning reward bias, showing how a persona‑specific habit leaked into general model behavior and how engineers responded.

GPT-5Goblin bugNerdy persona

0 likes · 8 min read

Why GPT‑5 Models Keep Talking About Goblins: RL Reward Leakage Uncovered

Machine Heart

Apr 30, 2026 · Artificial Intelligence

How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning

LWD (Learning While Deploying) introduces a distributed multi‑robot reinforcement‑learning framework that continuously improves VLA policies during real‑world deployment, leveraging DIVL, QAM, dynamic n‑step TD and an asynchronous actor‑learner architecture to achieve over 90% success on five‑minute tasks and outperform traditional behavior‑cloning, HG‑Dagger and RECAP baselines.

Embodied AILWDVLA

0 likes · 13 min read

How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning

PaperAgent

Apr 30, 2026 · Artificial Intelligence

Why Reinforcement Learning Is the Future: 2026 Top‑Conference RL Paper Collection

The article highlights the rapid rise of reinforcement learning across major 2026 conferences, curates 181 RL papers from eight top venues, and provides detailed summaries of innovative works such as MSRL and MedVR, offering free access to the papers and code.

Agentic RLMultimodal AIReward Modeling

0 likes · 6 min read

Why Reinforcement Learning Is the Future: 2026 Top‑Conference RL Paper Collection

PaperAgent

Apr 30, 2026 · Artificial Intelligence

How Agentic AI is Redefining World Modeling

The article reviews the paper "Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond", introducing a two‑axis framework (capability levels L1‑L3 and law domains) to map diverse world‑modeling systems, highlighting that most current systems stall at L1, that explicit law encoding is crucial for long‑term stability, and that L3 represents the ultimate, self‑evolving model.

AI agentsAI researchAgentic AI

1 likes · 6 min read

How Agentic AI is Redefining World Modeling

SuanNi

Apr 28, 2026 · Artificial Intelligence

ASI‑EVOLVE: AI Designs AI and Beats Human SOTA by Almost Three‑Fold

The open‑source ASI‑EVOLVE framework lets AI autonomously design AI across model architecture, data curation, and reinforcement‑learning algorithms, achieving up to three times the human‑level state‑of‑the‑art performance and demonstrating cross‑domain gains in drug‑target prediction.

AI-driven AIASI-EVOLVECross-domain AI

0 likes · 12 min read

ASI‑EVOLVE: AI Designs AI and Beats Human SOTA by Almost Three‑Fold

Machine Learning Algorithms & Natural Language Processing

Apr 28, 2026 · Artificial Intelligence

Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift

The paper introduces TEMPO, a test‑time training framework inspired by the Expectation‑Maximization algorithm, which alternates policy optimization (M‑step) with Critic calibration (E‑step) to prevent reward‑signal drift, and demonstrates on Qwen3 and OLMO3 models that it continuously improves reasoning performance and maintains output diversity beyond the saturation point of existing TTT methods.

EM algorithmTest-Time Traininglarge language models

0 likes · 14 min read

Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift

AI2ML AI to Machine Learning

Apr 28, 2026 · Artificial Intelligence

Which of the Three Types of AI Agents Are You Building?

The article classifies today’s booming AI agents into three categories—foundation‑model RL agents, OpenClaw‑style autonomous agents, and ontology‑driven agents—detailing their architectures, key components, comparative strengths, and how they converge toward the envisioned L4/L5 AGI stages.

AI agentsLLMMultimodal

0 likes · 9 min read

Which of the Three Types of AI Agents Are You Building?

Machine Heart

Apr 28, 2026 · Artificial Intelligence

Can LLMs Answer More Accurately While Writing Less? Introducing SHAPE’s Reasoning Tax

The SHAPE framework (Stage‑aware Hierarchical Advantage via Potential Estimation) adds a milestone‑based “reasoning tax” to large language model inference, providing step‑wise correctness signals and penalizing verbosity, which yields an average 3% accuracy gain and a 30% reduction in token consumption across multiple math‑reasoning benchmarks.

ACL 2026LLMSHAPE

0 likes · 10 min read

Can LLMs Answer More Accurately While Writing Less? Introducing SHAPE’s Reasoning Tax

Machine Heart

Apr 28, 2026 · Artificial Intelligence

World’s First Open‑Source Large Model for Real‑World Medical Video Understanding

The article introduces the globally first open‑source large model uAI‑NEXUS‑MedVLM, built on the MedVidBench dataset and the MedGRPO training framework, which together overcome data scarcity, evaluation gaps, and task specialization challenges in surgical video AI, achieving state‑of‑the‑art performance across eight benchmark tasks.

AI in SurgeryLarge Language ModelMedVidBench

0 likes · 18 min read

World’s First Open‑Source Large Model for Real‑World Medical Video Understanding

PMTalk Product Manager Community

Apr 28, 2026 · Artificial Intelligence

First Principle for Agent Product Managers: Choosing Between Single Agent, Multi‑Agent Collaboration, and Workflow

The article presents a decision framework for AI product managers, mapping workflow determinism and context certainty to four technical patterns—traditional RPA + AI, single Agent + RAG/knowledge graph, end‑to‑end RL Agent, and multi‑Agent collaboration—each with concrete use‑case examples and selection guidelines.

AI agentsMulti-Agent SystemsRPA

0 likes · 6 min read

First Principle for Agent Product Managers: Choosing Between Single Agent, Multi‑Agent Collaboration, and Workflow

360 Tech Engineering

Apr 28, 2026 · Artificial Intelligence

How 360 AI Institute Boosted Airline Translation Accuracy from 70% to 96%

The 360 AI Research Institute tackled the zero‑tolerance translation demands of airline maintenance by building a specialized parallel corpus and applying RAG‑enhanced, SFT‑fine‑tuned, and RL‑reinforced models, raising Chinese‑to‑English translation accuracy from 70% to 96% and enabling a one‑month rollout.

AI translationRAGSFT

0 likes · 5 min read

How 360 AI Institute Boosted Airline Translation Accuracy from 70% to 96%

AI Explorer

Apr 27, 2026 · Artificial Intelligence

Reinforcement Learning Scaling Law Shows How RL Fine‑Tuning Boosts Large Model Reasoning

A new study by USTC and Shanghai AI Lab uncovers a power‑law scaling relationship between RL fine‑tuning compute and large‑model reasoning performance, offering a quantitative way to predict and control AI capability growth.

AI researchScaling Lawlarge language models

0 likes · 7 min read

Reinforcement Learning Scaling Law Shows How RL Fine‑Tuning Boosts Large Model Reasoning

Machine Heart

Apr 27, 2026 · Artificial Intelligence

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

The paper presents a systematic empirical study that derives a power‑law scaling formula for reinforcement‑learning‑after‑training of large language models, demonstrating accurate inter‑ and intra‑model performance prediction, learning‑efficiency saturation, data‑reuse benefits, and cross‑architecture validity.

Data ReuseLlama 3Model Efficiency

0 likes · 11 min read

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

Machine Learning Algorithms & Natural Language Processing

Apr 25, 2026 · Artificial Intelligence

From Classic Multi-Agent Paradigms to Future Large-Foundation-Model-Driven Systems

This review surveys classic multi-agent systems and the emerging large-foundation-model-driven MAS paradigm, comparing their architectures, perception, communication, decision-making and control, and discusses how integrating LFMs enables semantic reasoning, greater adaptability, and new research challenges.

Agentic AILarge Foundation ModelsMulti-Agent Systems

0 likes · 8 min read

From Classic Multi-Agent Paradigms to Future Large-Foundation-Model-Driven Systems

Alibaba Cloud Developer

Apr 24, 2026 · Artificial Intelligence

How Hermes Agent Achieves Self‑Evolution: A Deep Dive into Prompt, Context, and Harness Design

This article provides a detailed technical analysis of Hermes Agent, explaining how its dynamic skill generation and reinforcement‑learning loop enable true self‑evolution, and examines the prompt engineering, context compression, memory architecture, harness mechanisms, error handling, and plugin ecosystem that differentiate it from OpenClaw and Claude Code.

Agent frameworkHermes AgentPrompt engineering

0 likes · 41 min read

How Hermes Agent Achieves Self‑Evolution: A Deep Dive into Prompt, Context, and Harness Design

Bighead's Algorithm Notes

Apr 22, 2026 · Artificial Intelligence

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

DeepAries is a novel deep reinforcement‑learning framework that jointly learns when to rebalance a portfolio and how to allocate assets by combining a Transformer‑based state encoder with PPO, and extensive experiments on four major markets show it significantly outperforms fixed‑frequency baselines in risk‑adjusted return, transaction cost, and drawdown.

DeepAriesPPOPortfolio Management

0 likes · 15 min read

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

AntTech

Apr 22, 2026 · Artificial Intelligence

How Multi‑Agent MCTS and Information‑Gain Rewards Are Transforming Mobile GUI and Search Agents

This article reviews two recent ICLR 2026 papers—M²‑Miner, a multi‑agent Monte‑Carlo Tree Search framework for low‑cost mobile GUI data mining, and IGPO, an information‑gain‑based reinforcement‑learning method that provides dense rewards for multi‑turn search agents—detailing their designs, experiments, and open‑source releases.

GUI Data MiningInformation GainLLM Agents

0 likes · 8 min read

How Multi‑Agent MCTS and Information‑Gain Rewards Are Transforming Mobile GUI and Search Agents

Java Architect Essentials

Apr 21, 2026 · Artificial Intelligence

Why Cursor’s Composer 2 Beats Claude Opus 4.6 in Performance and Cost

Cursor’s new Composer 2 model outperforms Claude Opus 4.6 on benchmarks like Terminal‑Bench 2.0, slashes pricing to $0.5/2.5 USD per million tokens, and introduces a self‑summary reinforcement‑learning technique that dramatically reduces context loss in long‑running coding tasks.

AI programmingComposer 2Cursor

0 likes · 9 min read

Why Cursor’s Composer 2 Beats Claude Opus 4.6 in Performance and Cost

Machine Heart

Apr 21, 2026 · Artificial Intelligence

Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

Monet introduces a training paradigm that lets multimodal large language models reason directly in a continuous latent visual space, replacing external tool calls with implicit visual embeddings, and demonstrates significant gains on both in‑distribution perception tasks and out‑of‑distribution abstract visual reasoning through a three‑stage supervised fine‑tuning and a novel visual‑latent policy optimization.

Latent EmbeddingMLLMMultimodal

0 likes · 15 min read

Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

AIWalker

Apr 20, 2026 · Artificial Intelligence

How VA‑π Bridges Tokenizers and Autoregressive Generators for Pixel‑Perfect Images

VA‑π introduces a lightweight post‑training framework that uses variational inference and reinforcement learning to align tokenizers with visual autoregressive generators, achieving dramatic quality gains, extreme training efficiency, and robust pixel‑level reconstruction across diverse image generation tasks.

Autoregressive ModelsPixel Alignmentpost-training

0 likes · 14 min read

How VA‑π Bridges Tokenizers and Autoregressive Generators for Pixel‑Perfect Images

Data Party THU

Apr 20, 2026 · Artificial Intelligence

How MemPO Uses Reinforcement Learning to Turn Agent Memory into a Trainable Policy

MemPO introduces a self‑memory policy optimization framework that lets long‑horizon LLM agents autonomously manage and refine their memory via reinforcement learning, using global‑trajectory and informative‑memory advantage estimates, achieving up to 25.98% F1 gain and 73% token reduction on benchmark tasks.

LLMLong-Horizon AgentsMemPO

0 likes · 8 min read

How MemPO Uses Reinforcement Learning to Turn Agent Memory into a Trainable Policy

Baidu Maps Tech Team

Apr 20, 2026 · Artificial Intelligence

How Baidu Maps Reinvents LBS Search with Multi‑Agent AI and RL

Facing the shift from keyword indexing to generative AI, Baidu Maps overhauled its LBS architecture by introducing a native multi‑agent system, context‑engineering (ACE) framework, and reinforcement‑learning alignment, enabling dynamic routing, knowledge evolution, and a 36% boost in planning compliance while maintaining zero‑tolerance for factual errors.

AI agentsLLMLocation‑based services

0 likes · 10 min read

How Baidu Maps Reinvents LBS Search with Multi‑Agent AI and RL

Old Zhang's AI Learning

Apr 19, 2026 · Artificial Intelligence

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

This guide shows how to fine‑tune Qwen3.5 models—from 0.8B to 122B—using Unsloth Studio or pure code, covering text SFT, vision fine‑tuning, MoE models, reinforcement‑learning (GRPO), extensive GGUF quantization benchmarks, hardware requirements, export formats, and deployment tips.

LLMQwen3.5Unsloth

0 likes · 12 min read

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

Machine Heart

Apr 19, 2026 · Artificial Intelligence

World Engine: How Post‑Training Is Launching a New Era of Physical AGI

World Engine introduces a post‑training pipeline that combines high‑fidelity 3DGS simulation, hard‑case mining with diffusion generation, and reinforcement‑learning optimization to give autonomous‑driving models true decision‑making ability, surpassing data‑scaling limits and achieving significant safety gains in both industrial simulations and real‑world tests.

Simulationautonomous drivinghard case mining

0 likes · 11 min read

World Engine: How Post‑Training Is Launching a New Era of Physical AGI

Machine Learning Algorithms & Natural Language Processing

Apr 16, 2026 · Artificial Intelligence

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

The article analyzes how reward‑shaping techniques can shorten the chain‑of‑thought outputs of Qwen 30‑parameter series models by 20‑40% while preserving or slightly improving performance on AIME‑25 and out‑of‑distribution benchmarks, and it details the experimental design, strategic considerations, and practical insights behind this efficient reasoning approach.

Efficient InferenceQwenreinforcement learning

0 likes · 16 min read

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

AI Explorer

Apr 16, 2026 · Artificial Intelligence

How NVIDIA, HKU, and MIT’s Sol‑RL Framework Supercharges Diffusion Model Training

NVIDIA, Hong Kong University, and MIT introduced the Sol‑RL framework, which uses reinforcement‑learning‑guided sampling to cut diffusion model training time by several‑fold without sacrificing image quality, potentially lowering entry barriers for small teams and shifting the AIGC industry toward an efficiency‑driven competition.

AIGCDiffusion ModelsNVIDIA

0 likes · 6 min read

How NVIDIA, HKU, and MIT’s Sol‑RL Framework Supercharges Diffusion Model Training

Xiaohongshu Tech REDtech

Apr 15, 2026 · Artificial Intelligence

How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines

Relax, an open‑source reinforcement‑learning engine from Xiaohongshu AI Platform, combines service‑oriented fault‑tolerant architecture, a distributed checkpoint service, and an asynchronous training pipeline to achieve up to 76% speed‑up and near‑zero overhead for multi‑modal RL workloads.

Asynchronous PipelineMulti-modalRay Serve

0 likes · 10 min read

How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines

SuanNi

Apr 12, 2026 · Artificial Intelligence

How MemPO Gives AI Agents Long‑Term Memory and Cuts Costs by 70%

The paper introduces MemPO, a self‑memory strategy optimization algorithm that lets large language model agents actively manage their memory, dramatically improving accuracy on complex multi‑step tasks while reducing token consumption by up to 73%, and validates the approach with extensive experiments and analysis.

AIEfficiencyMemory optimization

0 likes · 11 min read

How MemPO Gives AI Agents Long‑Term Memory and Cuts Costs by 70%

CodeTrend

Apr 11, 2026 · Artificial Intelligence

Inside OpenClaw: Architecture, Core Technologies, and Security Risks

The article provides a detailed technical analysis of the OpenClaw AI‑agent framework, covering its three‑layer architecture, prompt compiler, heartbeat mechanism, file‑based memory, skill system, ReAct loop, model‑agnostic routing, reinforcement‑learning extension, security concerns, and a side‑by‑side comparison with Hermes Agent.

Agent frameworkOpenClawfile-based memory

0 likes · 13 min read

Inside OpenClaw: Architecture, Core Technologies, and Security Risks

Machine Heart

Apr 11, 2026 · Artificial Intelligence

How 100,000 Hours of Human Data Propelled Psi‑R2 to Lead MolmoSpaces

Lingchu AI demonstrates that scaling human‑operation data to nearly 100,000 hours, combined with a two‑model system and reinforcement learning, can replace costly robot‑teleoperation data and achieve top performance on the MolmoSpaces benchmark.

Embodied AIPsi-R2Psi-W0

0 likes · 12 min read

How 100,000 Hours of Human Data Propelled Psi‑R2 to Lead MolmoSpaces

AI2ML AI to Machine Learning

Apr 10, 2026 · Artificial Intelligence

Why HermesAgent Outperforms OpenClaw: A Deep Source‑Code Analysis

The article dissects HermesAgent’s architecture, showing how it extends OpenClaw with self‑learning, reinforcement‑learning modules, and advanced prompt‑evolution techniques to mitigate token‑hole costs and achieve more deterministic results, while also detailing its TUI‑driven CLI and evaluation workflow.

DSPyGEPAHermesAgent

0 likes · 8 min read

Why HermesAgent Outperforms OpenClaw: A Deep Source‑Code Analysis

Machine Heart

Apr 10, 2026 · Artificial Intelligence

AdaGen: Enabling Adaptive, Data‑Driven Strategies for Image Generation Models

AdaGen replaces handcrafted static schedules in multi‑step image generators with a universal, learnable policy network trained via reinforcement learning, using an MDP formulation, adversarial rewards and action smoothing, achieving consistent quality and efficiency gains across diffusion, autoregressive, mask and flow models while adding negligible overhead.

MDPaction smoothingadaptive policy

0 likes · 11 min read

AdaGen: Enabling Adaptive, Data‑Driven Strategies for Image Generation Models

Machine Heart

Apr 9, 2026 · Artificial Intelligence

How TDM‑R1 Boosts Few‑Step Image Generation: GenEval Jumps from 61% to 92% and Beats GPT‑4o

The TDM‑R1 framework introduces a two‑stage reinforcement learning pipeline that lets 4‑step diffusion models achieve a GenEval score of 92%, surpassing 80‑step baselines and GPT‑4o while also fixing instruction compliance, text rendering, and compositional generation issues.

GenEvalOCR improvementTDM-R1

0 likes · 15 min read

How TDM‑R1 Boosts Few‑Step Image Generation: GenEval Jumps from 61% to 92% and Beats GPT‑4o

Alibaba Cloud Big Data AI Platform

Apr 9, 2026 · Artificial Intelligence

How Data Flywheels Accelerate Small Agentic Model Training

This article details a data‑flywheel framework for training compact agentic language models, describing synthetic task generation, mock environment simulation, rubric‑based reward design, iterative hard‑sample augmentation, and experimental results that show consistent performance gains across benchmarks.

Data AugmentationSynthetic Environmentsagentic models

0 likes · 17 min read

How Data Flywheels Accelerate Small Agentic Model Training

Machine Heart

Apr 9, 2026 · Artificial Intelligence

From Direct Generation to Agentic Text-to-Image: Introducing the Open-Source Gen-Searcher

Gen-Searcher equips text-to-image models with searchable, reasoning, and web‑browsing capabilities, turning the traditional direct‑generation pipeline into an agentic system that fetches and verifies real‑world knowledge, dramatically improving accuracy and quality across multiple benchmarks.

Agentic AIGen-SearcherKnowGen

0 likes · 7 min read

From Direct Generation to Agentic Text-to-Image: Introducing the Open-Source Gen-Searcher

Machine Heart

Apr 8, 2026 · Artificial Intelligence

Meta Unveils Muse Spark: The First Model from Its Superintelligence Lab

Meta has launched Muse Spark, its inaugural model from the newly formed Superintelligence Lab, showcasing multimodal capabilities, tool use, visual chain‑of‑thought, and multi‑agent orchestration, while detailing pretraining scaling gains, reinforcement‑learning improvements, and test‑time reasoning efficiencies.

AI scalingMetaMuse Spark

0 likes · 9 min read

Meta Unveils Muse Spark: The First Model from Its Superintelligence Lab

AIWalker

Apr 6, 2026 · Artificial Intelligence

How TIR‑Agent Turns Image‑Restoration Tools into a Learnable Decision‑Making Agent

The paper introduces TIR‑Agent, an image‑restoration agent that learns a tool‑calling policy via supervised fine‑tuning and reinforcement learning, addressing exploration stagnation and multi‑objective reward imbalance, and demonstrates over 2.5× faster inference and superior multi‑metric performance on synthetic and real degradation datasets.

Tool Schedulingagent-based AIcomputer vision

0 likes · 18 min read

How TIR‑Agent Turns Image‑Restoration Tools into a Learnable Decision‑Making Agent

DataFunSummit

Apr 5, 2026 · Industry Insights

How Datus AI Is Redefining Data Engineering with an Open‑Source Data Agent

This article examines Datus AI’s open‑source Data Engineering Agent, detailing its architecture, interactive context engineering, evaluation results, and future roadmap, and explains how it tackles the challenges of scaling AI‑driven data workflows.

AI agentsNL2SQLopen source

0 likes · 20 min read

How Datus AI Is Redefining Data Engineering with an Open‑Source Data Agent

Machine Heart

Apr 5, 2026 · Artificial Intelligence

Cut Token Costs by 68% with Dynamic Multi‑Agent Collaborative Coding

The paper introduces AgentConductor, a 3‑billion‑parameter orchestrator that generates adaptive YAML‑based multi‑agent topologies, dynamically re‑plans when code errors occur, achieving a 14.6% accuracy boost and up to 68% token‑cost reduction compared to existing static agent pipelines.

AgentConductorLLM code generationYAML topology

0 likes · 9 min read

Cut Token Costs by 68% with Dynamic Multi‑Agent Collaborative Coding

AI Engineer Programming

Apr 5, 2026 · Artificial Intelligence

How Kimi, Cursor, and Chroma Use Reinforcement Learning to Train Agent Models

The article analyzes three recent technical reports—Moonshot AI's Kimi K2.5, Cursor's Composer 2, and Chroma's Context‑1—detailing how each system trains agent models with reinforcement learning, parallel orchestration, self‑summarization, and self‑editing, and highlights shared methodological themes and performance gains.

Chroma Context-1Cursor ComposerKimi

0 likes · 19 min read

How Kimi, Cursor, and Chroma Use Reinforcement Learning to Train Agent Models

Machine Learning Algorithms & Natural Language Processing

Apr 4, 2026 · Artificial Intelligence

Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start

The paper reveals that over‑optimizing supervised fine‑tuning (SFT) for large language models can diminish their reinforcement‑learning (RL) potential, proposes an Adaptive Early‑Stop Loss (AESL) that balances accuracy and output diversity during cold‑start, and demonstrates across multiple LLMs that AESL consistently yields superior RL results.

AI trainingAdaptive Early‑Stop LossLLM

0 likes · 11 min read

Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start

Machine Heart

Apr 3, 2026 · Artificial Intelligence

Beyond Token Entropy: ReLaX Uses Latent Dynamics to Rethink Exploration‑Exploitation in LLM RL

The paper introduces ReLaX, a framework that shifts focus from token‑level entropy to the latent‑space dynamics of large models, employing Koopman operators and a Dynamic Spectral Divergence metric to quantitatively guide exploration‑exploitation balance, and demonstrates state‑of‑the‑art performance on both pure‑text and multimodal RL benchmarks.

Koopman operatorReLaXdynamic spectral divergence

0 likes · 12 min read

Beyond Token Entropy: ReLaX Uses Latent Dynamics to Rethink Exploration‑Exploitation in LLM RL

Machine Heart

Apr 2, 2026 · Artificial Intelligence

HSImul3R: Bridging Perception and Simulation for Physics‑Ready 3D Human‑Scene Interaction

HSImul3R introduces a physics‑in‑the‑loop reconstruction pipeline that closes the perception‑simulation gap by jointly optimizing human motion and scene geometry, leveraging reinforcement learning, direct simulation‑reward optimization, and a new HSIBench dataset to produce simulation‑ready 3D human‑scene interactions.

3D reconstructionDSROHSIBench

0 likes · 12 min read

HSImul3R: Bridging Perception and Simulation for Physics‑Ready 3D Human‑Scene Interaction

Machine Heart

Apr 2, 2026 · Artificial Intelligence

Breaking the Multi‑Robot Barrier: Sequential World‑Model Decomposition (ICLR 2026)

SeqWM introduces a sequential causal decomposition of joint dynamics, allowing each robot to model its marginal contribution conditioned on prior agents, which simplifies world‑model learning, enables intent‑sharing planning via MPPI, and achieves superior performance in challenging simulation benchmarks and real‑robot tests.

MPPISeqWMmodel-based RL

0 likes · 7 min read

Breaking the Multi‑Robot Barrier: Sequential World‑Model Decomposition (ICLR 2026)

Lao Guo's Learning Space

Apr 1, 2026 · Artificial Intelligence

Humans Achieve 100% While Top AI Models Score Below 0.4% on ARC‑AGI‑3 Benchmark

In the ARC‑AGI‑3 test, 486 random humans solved all 150+ game‑based puzzles with a perfect 100% success rate in a median of 7.4 minutes, whereas leading models such as GPT‑5, Claude Opus 4.6, Gemini 3.1 Pro and Grok 4.20 managed at most 0.37%, exposing a stark gap in meta‑cognitive reasoning.

AGIARC-AGI-3benchmark

0 likes · 9 min read

Humans Achieve 100% While Top AI Models Score Below 0.4% on ARC‑AGI‑3 Benchmark

Bighead's Algorithm Notes

Mar 31, 2026 · Artificial Intelligence

Top AI-Driven Quantitative Finance Papers from AAAI 2026

This article curates and summarizes recent AI research papers presented at AAAI 2026 that advance quantitative finance, covering controllable market generation, LLM‑powered alpha factor mining, risk‑aware multi‑agent portfolio management, foundation models for market data, and reinforcement‑learning trading policies.

AIDiffusion ModelsFinancial Market Simulation

0 likes · 12 min read

Top AI-Driven Quantitative Finance Papers from AAAI 2026

Machine Heart

Mar 31, 2026 · Artificial Intelligence

Can LLM Judges Be Trusted? TrustJudge Leverages Full Probability Distributions

LLM judges often produce contradictory scores and non‑transitive preferences; the TrustJudge framework replaces discrete scoring with distribution‑sensitive scoring and likelihood‑aware aggregation, dramatically reducing both score‑comparison and pairwise‑transitivity inconsistencies across multiple model families, improving accuracy and even serving as a reward signal for RL training.

LLM evaluationReward ModelingTrustJudge

0 likes · 12 min read

Can LLM Judges Be Trusted? TrustJudge Leverages Full Probability Distributions

Shi's AI Notebook

Mar 30, 2026 · Artificial Intelligence

AI Daily Digest March 30, 2026: Open‑Source Tools, Model Releases, and Research Highlights

The March 30 AI daily digest curates recent open‑source voice input and TypeScript libraries, new development workflows, a 30B parameter model that runs on 24 GB GPUs, and NVIDIA's PivotRL research that reduces reinforcement‑learning rollouts while matching end‑to‑end performance, all with concrete benchmarks and links.

AI toolsTypeScriptagent workflow

0 likes · 13 min read

AI Daily Digest March 30, 2026: Open‑Source Tools, Model Releases, and Research Highlights

Machine Heart

Mar 30, 2026 · Artificial Intelligence

Proactive Interaction for Video Multimodal Models: MMDuet2 & ProactiveVideoQA

This article surveys the ICLR 2026 papers ProactiveVideoQA and MMDuet2, detailing how video multimodal large models can decide when to reply autonomously, the PAUC benchmark for evaluating timeliness and accuracy, a reinforcement‑learning training pipeline that requires no precise timestamps, and experimental findings on data construction, frame‑sampling density, and SOTA performance.

MMDuet2PAUCbenchmark

0 likes · 17 min read

Proactive Interaction for Video Multimodal Models: MMDuet2 & ProactiveVideoQA

Bighead's Algorithm Notes

Mar 29, 2026 · Artificial Intelligence

How MetaTrader Uses Reinforcement Learning to Boost Trading Strategy Generalization

The article reviews the MetaTrader method, which formulates sequential portfolio optimization as a partially offline reinforcement‑learning problem, introduces a double‑layer RL algorithm and a conservative TD objective to improve out‑of‑distribution generalization, and demonstrates superior performance on CSI‑300 and NASDAQ‑100 datasets compared with existing baselines.

Financial TradingMetaTraderOOD data augmentation

0 likes · 15 min read

How MetaTrader Uses Reinforcement Learning to Boost Trading Strategy Generalization

DataFunSummit

Mar 29, 2026 · Artificial Intelligence

How Code Intelligence Is Evolving: From Foundation Models to Repository‑Level Agents

This article reviews the rapid evolution of code intelligence, covering the history of code foundation models, reinforcement‑learning optimizations, scaling‑law insights, the LoopCoder architecture, rigorous multi‑level evaluation suites, and the emergence of repository‑level code agents, while highlighting open‑source contributions such as Qwen‑Coder.

code evaluationcode-intelligencereinforcement learning

0 likes · 15 min read

How Code Intelligence Is Evolving: From Foundation Models to Repository‑Level Agents

Machine Heart

Mar 29, 2026 · Artificial Intelligence

Scaling World Model Dynamics to Over a Thousand Steps in Two ICLR Papers

The article reviews two ICLR papers by Haoxin Lin that advance world‑model dynamics from single‑step bootstrapping to any‑step direct prediction, introduce structured uncertainty via backtracking, and achieve stable full‑horizon roll‑outs of over a thousand steps, dramatically improving both online and offline reinforcement‑learning performance.

Offline RLany-step predictiondynamics modeling

0 likes · 16 min read

Scaling World Model Dynamics to Over a Thousand Steps in Two ICLR Papers

PaperAgent

Mar 29, 2026 · Industry Insights

From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development

The article examines the shift from traditional reasoning‑based large‑language‑model pipelines to agentic, harness‑driven AI systems, outlining the definition of a harness, its engineering challenges, architectural components, and the broader implications for training, reinforcement learning, and future research directions.

AI HarnessIntelligent agentsModel Training

0 likes · 16 min read

From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development

Bighead's Algorithm Notes

Mar 26, 2026 · Artificial Intelligence

Paper Reading: ArchetypeTrader – A Reinforcement‑Learning Framework for Selecting and Optimizing Crypto Trading Strategies

The article reviews the ArchetypeTrader framework, which addresses market‑segmentation and demonstration‑data issues in crypto‑currency reinforcement learning by discovering discrete trading archetypes, selecting them via a hierarchical RL agent, and refining actions with a regret‑aware adapter, achieving superior profit and risk‑adjusted returns across multiple markets.

cryptocurrency tradinghierarchical reinforcement learningregret-aware optimization

0 likes · 16 min read

Paper Reading: ArchetypeTrader – A Reinforcement‑Learning Framework for Selecting and Optimizing Crypto Trading Strategies

Alibaba Cloud Big Data AI Platform

Mar 25, 2026 · Artificial Intelligence

Scaling Multimodal Reinforcement Learning with NVIDIA Isaac Lab and TiledCamera

This article explains how to use NVIDIA Isaac Lab and the TiledCamera component to run large‑scale, multimodal reinforcement learning on GPU clusters, covering environment setup, noVNC visualization, command‑line execution, distributed training with torchrun, and performance analysis across multiple GPU configurations.

GPU scalingNVIDIA Isaac LabTiledCamera

0 likes · 12 min read

Scaling Multimodal Reinforcement Learning with NVIDIA Isaac Lab and TiledCamera

Bighead's Algorithm Notes

Mar 24, 2026 · Artificial Intelligence

How an Interactive Imitation‑Learning Agent Framework Trains Robust Trading Strategies

The article analyzes the simulation‑reality gap in algorithmic trading and proposes an interactive market simulator that combines a pool of imitation‑learning agents, an action‑synthesis network, and a DDPG‑based reinforcement‑learning trader, showing superior robustness and downside protection on QQQ data.

Agent-based ModelingDDPGfinancial AI

0 likes · 16 min read

How an Interactive Imitation‑Learning Agent Framework Trains Robust Trading Strategies

SuanNi

Mar 24, 2026 · Artificial Intelligence

How Memento‑Skills Enables Self‑Evolving LLMs Without Fine‑Tuning

Introducing Memento‑Skills, a novel framework that freezes LLM parameters while an external skill library iteratively reads, writes, and refines capabilities, achieving up to 116% accuracy gains on GAIA and HLE benchmarks and demonstrating scalable self‑evolution without costly model fine‑tuning.

LLMreinforcement learningself-evolution

0 likes · 11 min read

How Memento‑Skills Enables Self‑Evolving LLMs Without Fine‑Tuning

Machine Learning Algorithms & Natural Language Processing

Mar 22, 2026 · Artificial Intelligence

NS-Diff: Adding a Physics Engine to Diffusion Models for Fluid and Rigid‑Body Dynamics

The CVPR 2026 paper introduces NS‑Diff, a physics‑guided video diffusion framework that combines a noise‑robust dynamics detector, a physical‑condition latent injection module, and reinforcement‑learning optimization to reduce jerk error by 43 % and fluid divergence by 33 %, achieving superior physical realism and visual quality across multiple benchmarks.

CVPR 2026NS‑DiffNavier-Stokes

0 likes · 13 min read

NS-Diff: Adding a Physics Engine to Diffusion Models for Fluid and Rigid‑Body Dynamics

DataFunTalk

Mar 22, 2026 · Artificial Intelligence

Why Cursor’s Composer 2 Beats Claude Opus 4.6 in Performance and Price

Cursor’s new Composer 2 programming model outperforms Claude Opus 4.6 on benchmarks like Terminal‑Bench 2.0 and SWE‑bench Multilingual, while slashing token costs to $0.5/M input and $2.5/M output, thanks to a novel self‑summary reinforcement‑learning technique that enables efficient long‑context processing.

AILarge Language Modelpricing

0 likes · 8 min read

Why Cursor’s Composer 2 Beats Claude Opus 4.6 in Performance and Price

PaperAgent

Mar 22, 2026 · Artificial Intelligence

Can LLM Agents Self‑Evolve Without Retraining? Inside Memento‑Skills

The article analyzes the Memento‑Skills framework, which treats external memory as executable skills to enable deployment‑time continual learning for frozen LLM agents, detailing its read‑write reflective loop, skill‑as‑memory design, behavior‑trained skill router, experimental validation on GAIA and HLE benchmarks, and theoretical guarantees without gradient updates.

AIAgentContinual Learning

0 likes · 9 min read

Can LLM Agents Self‑Evolve Without Retraining? Inside Memento‑Skills

AI Engineering

Mar 21, 2026 · Industry Insights

Is Cursor’s Composer 2 Powered by Kimi? The Truth Is More Complex

A developer uncovered that Cursor’s Composer 2 actually runs on the Kimi K2.5 model with reinforcement learning, prompting a rapid licensing dispute that ended with official confirmation and highlights the opaque yet collaborative nature of today’s open AI model ecosystem.

AI model licensingComposer 2Cursor

0 likes · 4 min read

Is Cursor’s Composer 2 Powered by Kimi? The Truth Is More Complex

Bighead's Algorithm Notes

Mar 20, 2026 · Artificial Intelligence

Weekly Quantitative Finance Paper Summaries (Mar 14‑Mar 20, 2026)

This article compiles abstracts of four recent AI‑driven quantitative finance papers, covering an autonomous factor‑investing framework, a program‑level factor‑mining system, an adaptive regime‑aware stock‑price predictor with reinforcement learning, and a comprehensive analysis of AI agents in financial markets.

AI agentsfactor investinglarge language models

0 likes · 10 min read

Weekly Quantitative Finance Paper Summaries (Mar 14‑Mar 20, 2026)

Machine Learning Algorithms & Natural Language Processing

Mar 20, 2026 · Artificial Intelligence

Cursor’s Composer 2 Beats Claude Opus 4.6 with ‘Ankle‑Cut’ Pricing via New Reinforcement‑Learning Method

Cursor’s newly released Composer 2 model surpasses Claude Opus 4.6 on benchmarks such as Terminal‑Bench 2.0, offers dramatically lower token pricing, and achieves these gains by introducing a novel self‑summary reinforcement‑learning technique that compresses long‑context tasks while preserving critical information.

Composer 2CursorLLM

0 likes · 9 min read

Cursor’s Composer 2 Beats Claude Opus 4.6 with ‘Ankle‑Cut’ Pricing via New Reinforcement‑Learning Method

AI Explorer

Mar 20, 2026 · Industry Insights

Key AI Breakthroughs and Market Moves on March 20 2026

On March 20 2026, Alibaba’s Qwen 3.5‑Max topped the LMArena blind‑test, OpenAI bought Astral to boost AI coding, Zhejiang University released a real‑time 4D world model, Meta’s Agent leaked data, and a series of AI‑driven innovations from Nvidia, robotics to drug discovery reshaped the industry.

AIAI design toolsAI hardware

0 likes · 7 min read

Key AI Breakthroughs and Market Moves on March 20 2026

Machine Learning Algorithms & Natural Language Processing

Mar 19, 2026 · Artificial Intelligence

From Solving to Evolving: How RETROAGENT Gives AI Agents Real Retrospective Learning

The article analyzes the RETROAGENT framework, showing how its dual intrinsic feedback and memory‑buffer mechanisms enable LLM agents to move beyond solving tasks toward continual evolution, and presents benchmark results that demonstrate significant performance gains and strong test‑time adaptation across four challenging environments.

LLM AgentsRETROAGENTdual intrinsic feedback

0 likes · 7 min read

From Solving to Evolving: How RETROAGENT Gives AI Agents Real Retrospective Learning

Machine Learning Algorithms & Natural Language Processing

Mar 19, 2026 · Artificial Intelligence

From Language Modeling to World Modeling: Limits of Large Language Models

Speaker Li Yixia from Southern University of Science and Technology presents a talk on using large language models as textual world models, defining a three‑layer evaluation framework and showing through experiments that fine‑tuned models improve next‑state prediction and agent performance, yet face limits tied to behavior coverage and environment complexity.

agent performanceevaluation frameworklarge language models

0 likes · 4 min read

From Language Modeling to World Modeling: Limits of Large Language Models

Xiaomi Tech

Mar 18, 2026 · Artificial Intelligence

Xiaomi Unveils MiMo-V2-TTS: Giving Agents a Voice with Soul

Xiaomi introduces MiMo-V2-TTS, a self‑developed speech‑synthesis large model that combines a custom audio tokenizer, multi‑codebook architecture, massive pre‑training on over a hundred million hours of data and multi‑dimensional reinforcement learning to deliver fine‑grained style control, dialect support, role‑play and high‑quality singing, aiming to give AI agents expressive, human‑like voices.

Speech synthesisaudio tokenizerlarge model

0 likes · 6 min read

Xiaomi Unveils MiMo-V2-TTS: Giving Agents a Voice with Soul

AI Explorer

Mar 17, 2026 · Artificial Intelligence

RISE Enables Breakthrough in Vision‑Language‑Action Learning for Embodied AI

The article examines the limitations of vision‑language‑action (VLA) models in real‑world tasks, explains how the RISE technique from Hong Kong University uses internal simulation, reflection and imagination to cut training costs by an order of magnitude, and discusses its implications for future embodied AI.

Embodied AIRISEVLA

0 likes · 6 min read

RISE Enables Breakthrough in Vision‑Language‑Action Learning for Embodied AI

Software Engineering 3.0 Era

Mar 17, 2026 · Artificial Intelligence

How Learning Theory Drives AI‑Powered Software Engineering 3.0

The article explains how machine‑learning theory, especially large‑language‑model training and Reinforcement Learning from Human Feedback, underpins Software Engineering 3.0 by turning code generation into a data‑driven learning process, reshaping cognition, alignment, and continuous system evolution.

Distributed CognitionRLHFlarge language models

0 likes · 12 min read

How Learning Theory Drives AI‑Powered Software Engineering 3.0

Machine Learning Algorithms & Natural Language Processing

Mar 15, 2026 · Artificial Intelligence

Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods

The MIT‑CSAIL paper introduces RandOpt, a single‑step, gradient‑free, fully parallel post‑training algorithm that adds Gaussian noise to pretrained LLM weights and ensembles the results, achieving or surpassing PPO/GRPO performance by exploiting dense "neural thickets" that emerge as model scale grows.

EnsembleLLMRandOpt

0 likes · 12 min read

Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods

SuanNi

Mar 12, 2026 · Artificial Intelligence

How OpenClaw‑RL Turns Everyday Interactions into Self‑Evolving AI

OpenClaw‑RL, a new reinforcement‑learning framework from Princeton, captures hidden evaluative and instructional signals in daily user interactions, converts them into real‑time training data, and uses a decoupled asynchronous architecture with binary RL and online policy distillation to achieve superior performance in both personal‑device and cloud‑scale scenarios.

AI FeedbackAsynchronous ArchitectureOnline Distillation

0 likes · 10 min read

How OpenClaw‑RL Turns Everyday Interactions into Self‑Evolving AI

AIWalker

Mar 12, 2026 · Artificial Intelligence

BeautyGRPO: RL‑Driven Realistic Portrait Retouching Ends Over‑Beautification (CVPR 2026)

The paper introduces BeautyGRPO, a reinforcement‑learning framework that combines a fine‑grained preference dataset (FRPref‑10K) with Dynamic Path Guidance to balance aesthetic enhancement and high‑fidelity preservation in portrait retouching, achieving superior metrics and user preference over existing SFT and RL models.

AI aestheticsCVPR 2026dynamic path guidance

0 likes · 11 min read

BeautyGRPO: RL‑Driven Realistic Portrait Retouching Ends Over‑Beautification (CVPR 2026)

Didi Tech

Mar 12, 2026 · Artificial Intelligence

How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens

The STAPO (Spurious‑Token‑Aware Policy Optimization) algorithm, introduced by Tsinghua University's iDLab and Didi's Deep Sea Lab, tackles policy‑entropy instability and performance oscillation in reinforcement‑learning fine‑tuning of large models by mathematically analyzing token collision probability, defining spurious tokens, and applying a Silencing Spurious Tokens mechanism that yields state‑of‑the‑art results on multiple math‑reasoning benchmarks.

AI safetySTAPOfine-tuning

0 likes · 7 min read

How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens

DataFunTalk

Mar 11, 2026 · Artificial Intelligence

Agent Lightning: Decoupling Optimizers to Empower AI Agents via Reinforcement Learning

Agent Lightning, an open‑source system from Microsoft Research Asia, introduces a novel optimizer‑agent disaggregation architecture that enables any AI agent to benefit from reinforcement learning, offering non‑intrusive experience capture, programmable pipelines, and flexible signal passing, while addressing real‑world challenges of scalability, multi‑step tasks, and zero‑code integration.

Agent LightningExperience CaptureLearning Systems

0 likes · 21 min read

Agent Lightning: Decoupling Optimizers to Empower AI Agents via Reinforcement Learning

DataFunSummit

Mar 10, 2026 · Artificial Intelligence

How Agent Lightning Redefines AI Agent Learning with Optimizer‑Agent Decoupling

The article explores the paradigm shift toward AI agents in 2025, detailing the open‑source Agent Lightning project’s architecture, non‑intrusive experience capture, programmable pipelines, and experimental results that demonstrate its ability to enable reinforcement learning for any agent with minimal code changes.

Agent LightningOpen Source Frameworkmachine learning

0 likes · 20 min read

How Agent Lightning Redefines AI Agent Learning with Optimizer‑Agent Decoupling

PaperAgent

Mar 10, 2026 · Artificial Intelligence

How MemSifter Delivers High‑Precision, Low‑Cost Long‑Term Memory for LLMs

MemSifter introduces a lightweight agent that outsources memory retrieval for large language models, using a Think‑and‑Rank pipeline and a task‑result‑oriented reinforcement‑learning training paradigm to achieve superior retrieval accuracy and efficiency across eight benchmark tasks while keeping inference overhead minimal.

AgentEfficiencyLLM

0 likes · 13 min read

How MemSifter Delivers High‑Precision, Low‑Cost Long‑Term Memory for LLMs

AI Explorer

Mar 6, 2026 · Artificial Intelligence

AReaL: Lightning‑Fast Asynchronous RL Engine for Building High‑Performance LLM Agents

AReaL, an open‑source, fully asynchronous reinforcement‑learning platform co‑developed by Tsinghua University and Ant Group, dramatically speeds up training of complex LLM agents, offering a simple, stable, and hardware‑flexible solution for developers seeking industrial‑grade AI agents.

AI InfrastructureAReaLAsynchronous Training

0 likes · 7 min read

AReaL: Lightning‑Fast Asynchronous RL Engine for Building High‑Performance LLM Agents

Tencent Cloud Developer

Mar 5, 2026 · Artificial Intelligence

20 Cutting‑Edge RAG Optimization Techniques: From Semantic Chunking to Self‑RAG

This article systematically presents twenty practical RAG (Retrieval‑Augmented Generation) optimization methods—covering semantic chunking, chunk‑size evaluation, context‑enhanced retrieval, query transformation, re‑ranking, feedback loops, multimodal and graph RAG, hierarchical retrieval, HyDE, Self‑RAG and reinforcement‑learning‑enhanced RAG—each with clear Python code examples, advantages, limitations and ideal use‑cases.

AILLMRAG

0 likes · 57 min read

20 Cutting‑Edge RAG Optimization Techniques: From Semantic Chunking to Self‑RAG

Kuaishou Tech

Mar 4, 2026 · Artificial Intelligence

How LLMs Are Revolutionizing Reinforcement Learning for Recommendation Systems

This survey examines the emerging LLM‑RL collaborative recommendation paradigm, outlining its research background, five main collaboration patterns, standardized evaluation protocols, and the key challenges and future directions for building smarter, more robust recommender systems.

LLMRecommendation Systemsartificial-intelligence

0 likes · 14 min read

How LLMs Are Revolutionizing Reinforcement Learning for Recommendation Systems

Woodpecker Software Testing

Mar 4, 2026 · Artificial Intelligence

Deep Dive into Adversarial Testing Performance Optimization for AI Systems

The article examines Adversarial Testing Performance Optimization (ATPO) as a new industrial-quality paradigm, detailing how adversarial samples expose hidden performance bottlenecks across AI pipelines, presenting three typical adversarial loads with corresponding optimization targets, common implementation pitfalls, and emerging intelligent approaches using reinforcement learning and digital twins.

AI pipelinesDigital TwinPerformance Optimization

0 likes · 8 min read

Deep Dive into Adversarial Testing Performance Optimization for AI Systems

PaperAgent

Mar 3, 2026 · Artificial Intelligence

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

The article presents CharacterFlywheel, a 15‑generation flywheel methodology that iteratively improves social‑dialogue LLMs in production using data‑driven reward models, rejection sampling, and a mix of SFT, DPO, and RL, with detailed experiments and best‑practice insights.

AI safetyLLM OptimizationReward Modeling

0 likes · 12 min read

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

PaperAgent

Mar 2, 2026 · Artificial Intelligence

SKILLRL: Boosting LLM Agents with Skill Distillation and Recursive Evolution

SKILLRL introduces a novel framework that transforms raw LLM agent trajectories into compact, reusable skills via experience‑driven distillation, hierarchical skill banks, and recursive skill evolution, achieving up to 90% success on ALFWorld and 73% on WebShop while reducing token usage by over 10% compared to memory‑based baselines.

LLM AgentsSKILLRLhierarchical skill bank

0 likes · 10 min read

SKILLRL: Boosting LLM Agents with Skill Distillation and Recursive Evolution

AI Explorer

Mar 2, 2026 · Artificial Intelligence

OpenSandbox: Alibaba’s Open‑Source AI Sandbox for Secure, Scalable Agent Execution

OpenSandbox, an open‑source sandbox platform from Alibaba, offers a unified, secure, and extensible execution environment for AI agents, code execution, and reinforcement‑learning workloads, leveraging Docker and high‑performance Kubernetes runtimes, with multi‑language SDKs and fine‑grained network controls.

AI agentsAI sandboxDocker

0 likes · 7 min read

OpenSandbox: Alibaba’s Open‑Source AI Sandbox for Secure, Scalable Agent Execution

Xiaomi Tech

Mar 2, 2026 · Artificial Intelligence

How Xiaomi’s Tactile‑Enabled Robot Graduates from Lab to Automotive Assembly Line

The article details Xiaomi Robotics' transition of its VLA‑based robot with TacRefineNet tactile perception from laboratory experiments to a real automotive factory, achieving a 90.2% dual‑side success rate over three hours while meeting a 76‑second production cycle, and explains the end‑to‑end data‑driven control, multimodal sensing, whole‑body motion strategy, failure cases, and open resources.

TacRefineNetVLA modelXiaomi Robotics

0 likes · 8 min read

How Xiaomi’s Tactile‑Enabled Robot Graduates from Lab to Automotive Assembly Line

AI Frontier Lectures

Feb 28, 2026 · Artificial Intelligence

Can Reinforcement Learning Revolutionize Text-to-3D Generation? A Deep Dive

This article presents a systematic investigation of applying reinforcement learning to text‑to‑3D generation, detailing reward design, algorithm selection, a new 3D benchmark, a hierarchical GRPO framework, extensive ablations, and the resulting performance gains and limitations.

AI researchgenerative modelsreinforcement learning

0 likes · 13 min read

Can Reinforcement Learning Revolutionize Text-to-3D Generation? A Deep Dive

Baobao Algorithm Notes

Feb 24, 2026 · Artificial Intelligence

The Bitter Lesson of Building Agentic RL in Terminal Environments

This article recounts the challenges of moving from single‑step RL with verifiable rewards to multi‑step agentic reinforcement learning in terminal environments, detailing infrastructure design, asynchronous pipelines, data quality checks, masking strategies, curriculum training, chunk‑based optimization, and practical lessons learned from large‑scale experiments.

Agentic RLAsynchronous TrainingCredit Assignment

0 likes · 33 min read

The Bitter Lesson of Building Agentic RL in Terminal Environments

Machine Learning Algorithms & Natural Language Processing

Feb 23, 2026 · Artificial Intelligence

System Engineering Behind Billions of Parameters: Insider Training Details from Seven Top AI Labs

This article systematically dissects the engineering decisions behind frontier large‑language‑model training—covering architecture choices, attention variants, optimizer evolution, data‑curation strategies, scaling‑law insights, and post‑training SFT/RL pipelines—based on open‑source reports from seven leading AI laboratories.

Mixture of ExpertsModel Traininglarge language models

0 likes · 26 min read

System Engineering Behind Billions of Parameters: Insider Training Details from Seven Top AI Labs

HyperAI Super Neural

Feb 19, 2026 · Artificial Intelligence

World Model & VLA Breakthroughs: Top Papers from NVIDIA, ByteDance, Tsinghua and Others

This roundup highlights six recent embodied AI papers that advance world models and vision‑language‑action (VLA) techniques, covering DreamDojo's massive first‑person video model, LingBot‑World simulator, Agent World Model generator, BagelVLA, ACoT‑VLA, and the closed‑loop World‑VLA‑Loop framework.

Embodied AISynthetic Environmentsreinforcement learning

0 likes · 8 min read

World Model & VLA Breakthroughs: Top Papers from NVIDIA, ByteDance, Tsinghua and Others

Machine Learning Algorithms & Natural Language Processing

Feb 18, 2026 · Artificial Intelligence

Microsoft’s 671B LLM Unifies Offline Ad Tasks—Can It Cut Compute Costs?

Microsoft’s AdNanny replaces a forest of specialized offline models with a single 671 B LLM, using a three‑stage data factory to generate reasoning‑rich corpora, dynamic task re‑weighting, RL‑based metric alignment, and a hybrid 31‑pipeline‑parallel architecture that halves compute cost while boosting performance on core ad‑ranking tasks.

AdNannyLLMdynamic weighting

0 likes · 9 min read

Microsoft’s 671B LLM Unifies Offline Ad Tasks—Can It Cut Compute Costs?

Old Zhang's AI Learning

Feb 16, 2026 · Artificial Intelligence

Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide

This article provides a detailed analysis of Qwen3.5, covering its multimodal MoE design, massive inference speedups, extensive benchmark results against GPT‑5.2, Claude 4.5 Opus and Gemini‑3 Pro, RL scaling strategies, training infrastructure innovations, and practical usage via API and local deployment.

FP8 trainingLarge Language ModelMultimodal AI

0 likes · 13 min read

Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide

Machine Learning Algorithms & Natural Language Processing

Feb 15, 2026 · Artificial Intelligence

Embedding Error Correction into the Policy Space: How Search‑R2 Redefines Search‑Enhanced Reasoning

The Search‑R2 framework integrates error detection, localization, and correction into a reinforcement‑learning loop for search‑enhanced reasoning, achieving notably larger accuracy gains on difficult multi‑hop QA tasks than baseline methods, even when those baselines receive higher sampling budgets.

Agentic AIError CorrectionMulti-hop QA

0 likes · 15 min read