Tagged articles

reinforcement learning

743 articles · Page 2 of 8
Alimama Tech
Alimama Tech
May 7, 2026 · Artificial Intelligence

Dual‑Phase RL‑LLM Framework DARA for Few‑Shot Online Advertising Budget Allocation

The DARA framework splits online advertising budget allocation into a few‑shot LLM reasoning stage and a fine‑grained optimizer stage, enhanced by a dynamically updated RL‑fine‑tuning algorithm (GRPO‑Adaptive), achieving significantly lower ROI variance than traditional baselines in both real and simulated environments.

LLMOnline Advertisingbudget allocation
0 likes · 16 min read
Dual‑Phase RL‑LLM Framework DARA for Few‑Shot Online Advertising Budget Allocation
PaperAgent
PaperAgent
May 7, 2026 · Artificial Intelligence

190 Must-Read AI Agent Papers + 321 Google Implementation Cases – Free Resource Pack

The article provides a free compiled resource containing 190 essential AI Agent papers—from fundamentals to cutting‑edge topics—along with 321 Google‑released implementation cases and 500 open‑source agent applications, all with source code to help beginners and researchers quickly understand the field and reproduce results.

AI AgentLLMResearch Papers
0 likes · 6 min read
190 Must-Read AI Agent Papers + 321 Google Implementation Cases – Free Resource Pack
Machine Heart
Machine Heart
May 6, 2026 · Artificial Intelligence

Can Adaptive Guidance Unlock Small Model Reasoning? Introducing G²RPO‑A

The paper identifies reward sparsity as the core obstacle for small language models in reinforcement‑learning‑based reasoning, proposes G²RPO‑A which injects high‑quality thinking trajectories and dynamically adjusts guidance length, and demonstrates large accuracy gains on math and code benchmarks such as Qwen3‑1.7B improving from 50.96 % to 67.21 % on MATH500 and from 46.08 % to 75.93 % on HumanEval.

G²RPO‑Aadaptive guidancecode generation
0 likes · 10 min read
Can Adaptive Guidance Unlock Small Model Reasoning? Introducing G²RPO‑A
Machine Heart
Machine Heart
May 6, 2026 · Artificial Intelligence

PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels

PromptEcho computes a continuous reward for text‑to‑image generation by measuring how well a frozen vision‑language model can reconstruct the original prompt from the generated image, eliminating the need for annotated data or a trained reward model and outperforming prior methods across multiple benchmarks.

PromptEchoReward Modelingbenchmark
0 likes · 10 min read
PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 5, 2026 · Artificial Intelligence

LLMBeginner: A Project‑Based Roadmap for Zero‑Base Mastery of Large Language Models

The LLMBeginner project from the MLNLP community offers a staged, project‑oriented learning path—covering big‑picture concepts, deep learning and reinforcement learning fundamentals, LLM theory and practice, and agent development—to guide beginners from fragmented resources to systematic mastery, with both concise and detailed versions hosted on GitHub.

AgentGitHubLLM
0 likes · 5 min read
LLMBeginner: A Project‑Based Roadmap for Zero‑Base Mastery of Large Language Models
Data Party THU
Data Party THU
May 4, 2026 · Artificial Intelligence

Understanding the Mathematical Foundations of Reinforcement Learning

This article provides a concise overview of a ten‑chapter reinforcement‑learning textbook, outlining the progression from basic concepts such as states and rewards to advanced algorithms like policy gradients and actor‑critic methods, and explains how each chapter builds on the previous ones.

Bellman equationMonte Carloactor-critic
0 likes · 11 min read
Understanding the Mathematical Foundations of Reinforcement Learning
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 2, 2026 · Artificial Intelligence

Real-World Large-Scale Test Shows Robots Learning While Deploying Outperform Baselines on Eight Tasks

The article presents the LWD (Learning While Deploying) framework, detailing its reinforcement‑learning‑driven data flywheel, the DIVL value‑evaluation and QAM policy‑optimization modules, and experimental results where a dual‑arm robot improves success rates by up to 17% and reduces cycle time by 23.75 seconds across eight real‑world tasks, surpassing strong baselines.

DIVLData FlywheelLWD
0 likes · 12 min read
Real-World Large-Scale Test Shows Robots Learning While Deploying Outperform Baselines on Eight Tasks
AI Explorer
AI Explorer
May 2, 2026 · Industry Insights

AI Industry Highlights May 2, 2026: Funding Surge, New Tools, and Research Breakthroughs

In May 2026, the AI sector saw a 77% rise in capital spending by the four biggest tech firms, Meta's acquisition of robot startup ARI, reinforcement‑learning advances boosting LLM inference, OpenAI's ChatGPT Images 2.0 launch, Tencent's Hy‑MT model outperforming Google, Microsoft's legal‑AI assistant, a 400B model running on iPhone, and notable research from CMU and independent scholars.

AI InvestmentCMU researchMeta
0 likes · 5 min read
AI Industry Highlights May 2, 2026: Funding Surge, New Tools, and Research Breakthroughs
Machine Heart
Machine Heart
May 1, 2026 · Artificial Intelligence

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

This article surveys the rapid evolution of reinforcement‑learning algorithms for large‑language‑model inference from early REINFORCE and PPO to newer approaches such as GRPO, RLOO, DAPO, CISPO, DPPO, ScaleRL and MaxRL, highlighting their design motivations, mathematical formulations, empirical trade‑offs and open research challenges.

GRPOLLMMaxRL
0 likes · 27 min read
From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference
Machine Heart
Machine Heart
Apr 30, 2026 · Artificial Intelligence

Why GPT‑5 Models Keep Talking About Goblins: RL Reward Leakage Uncovered

The article analyzes how DeepSeek’s "极" bug and OpenAI’s recurring "goblin" output stem from unclean training data and an unintended reinforcement‑learning reward bias, showing how a persona‑specific habit leaked into general model behavior and how engineers responded.

GPT-5Goblin bugNerdy persona
0 likes · 8 min read
Why GPT‑5 Models Keep Talking About Goblins: RL Reward Leakage Uncovered
Machine Heart
Machine Heart
Apr 30, 2026 · Artificial Intelligence

How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning

LWD (Learning While Deploying) introduces a distributed multi‑robot reinforcement‑learning framework that continuously improves VLA policies during real‑world deployment, leveraging DIVL, QAM, dynamic n‑step TD and an asynchronous actor‑learner architecture to achieve over 90% success on five‑minute tasks and outperform traditional behavior‑cloning, HG‑Dagger and RECAP baselines.

Embodied AILWDVLA
0 likes · 13 min read
How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning
PaperAgent
PaperAgent
Apr 30, 2026 · Artificial Intelligence

How Agentic AI is Redefining World Modeling

The article reviews the paper "Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond", introducing a two‑axis framework (capability levels L1‑L3 and law domains) to map diverse world‑modeling systems, highlighting that most current systems stall at L1, that explicit law encoding is crucial for long‑term stability, and that L3 represents the ultimate, self‑evolving model.

AI agentsAI researchAgentic AI
1 likes · 6 min read
How Agentic AI is Redefining World Modeling
SuanNi
SuanNi
Apr 28, 2026 · Artificial Intelligence

ASI‑EVOLVE: AI Designs AI and Beats Human SOTA by Almost Three‑Fold

The open‑source ASI‑EVOLVE framework lets AI autonomously design AI across model architecture, data curation, and reinforcement‑learning algorithms, achieving up to three times the human‑level state‑of‑the‑art performance and demonstrating cross‑domain gains in drug‑target prediction.

AI-driven AIASI-EVOLVECross-domain AI
0 likes · 12 min read
ASI‑EVOLVE: AI Designs AI and Beats Human SOTA by Almost Three‑Fold
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 28, 2026 · Artificial Intelligence

Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift

The paper introduces TEMPO, a test‑time training framework inspired by the Expectation‑Maximization algorithm, which alternates policy optimization (M‑step) with Critic calibration (E‑step) to prevent reward‑signal drift, and demonstrates on Qwen3 and OLMO3 models that it continuously improves reasoning performance and maintains output diversity beyond the saturation point of existing TTT methods.

EM algorithmTest-Time Traininglarge language models
0 likes · 14 min read
Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Apr 28, 2026 · Artificial Intelligence

Which of the Three Types of AI Agents Are You Building?

The article classifies today’s booming AI agents into three categories—foundation‑model RL agents, OpenClaw‑style autonomous agents, and ontology‑driven agents—detailing their architectures, key components, comparative strengths, and how they converge toward the envisioned L4/L5 AGI stages.

AI agentsLLMMultimodal
0 likes · 9 min read
Which of the Three Types of AI Agents Are You Building?
Machine Heart
Machine Heart
Apr 28, 2026 · Artificial Intelligence

Can LLMs Answer More Accurately While Writing Less? Introducing SHAPE’s Reasoning Tax

The SHAPE framework (Stage‑aware Hierarchical Advantage via Potential Estimation) adds a milestone‑based “reasoning tax” to large language model inference, providing step‑wise correctness signals and penalizing verbosity, which yields an average 3% accuracy gain and a 30% reduction in token consumption across multiple math‑reasoning benchmarks.

ACL 2026LLMSHAPE
0 likes · 10 min read
Can LLMs Answer More Accurately While Writing Less? Introducing SHAPE’s Reasoning Tax
Machine Heart
Machine Heart
Apr 28, 2026 · Artificial Intelligence

World’s First Open‑Source Large Model for Real‑World Medical Video Understanding

The article introduces the globally first open‑source large model uAI‑NEXUS‑MedVLM, built on the MedVidBench dataset and the MedGRPO training framework, which together overcome data scarcity, evaluation gaps, and task specialization challenges in surgical video AI, achieving state‑of‑the‑art performance across eight benchmark tasks.

AI in SurgeryLarge Language ModelMedVidBench
0 likes · 18 min read
World’s First Open‑Source Large Model for Real‑World Medical Video Understanding
PMTalk Product Manager Community
PMTalk Product Manager Community
Apr 28, 2026 · Artificial Intelligence

First Principle for Agent Product Managers: Choosing Between Single Agent, Multi‑Agent Collaboration, and Workflow

The article presents a decision framework for AI product managers, mapping workflow determinism and context certainty to four technical patterns—traditional RPA + AI, single Agent + RAG/knowledge graph, end‑to‑end RL Agent, and multi‑Agent collaboration—each with concrete use‑case examples and selection guidelines.

AI agentsMulti-Agent SystemsRPA
0 likes · 6 min read
First Principle for Agent Product Managers: Choosing Between Single Agent, Multi‑Agent Collaboration, and Workflow
360 Tech Engineering
360 Tech Engineering
Apr 28, 2026 · Artificial Intelligence

How 360 AI Institute Boosted Airline Translation Accuracy from 70% to 96%

The 360 AI Research Institute tackled the zero‑tolerance translation demands of airline maintenance by building a specialized parallel corpus and applying RAG‑enhanced, SFT‑fine‑tuned, and RL‑reinforced models, raising Chinese‑to‑English translation accuracy from 70% to 96% and enabling a one‑month rollout.

AI translationRAGSFT
0 likes · 5 min read
How 360 AI Institute Boosted Airline Translation Accuracy from 70% to 96%
Machine Heart
Machine Heart
Apr 27, 2026 · Artificial Intelligence

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

The paper presents a systematic empirical study that derives a power‑law scaling formula for reinforcement‑learning‑after‑training of large language models, demonstrating accurate inter‑ and intra‑model performance prediction, learning‑efficiency saturation, data‑reuse benefits, and cross‑architecture validity.

Data ReuseLlama 3Model Efficiency
0 likes · 11 min read
ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 25, 2026 · Artificial Intelligence

From Classic Multi-Agent Paradigms to Future Large-Foundation-Model-Driven Systems

This review surveys classic multi-agent systems and the emerging large-foundation-model-driven MAS paradigm, comparing their architectures, perception, communication, decision-making and control, and discusses how integrating LFMs enables semantic reasoning, greater adaptability, and new research challenges.

Agentic AILarge Foundation ModelsMulti-Agent Systems
0 likes · 8 min read
From Classic Multi-Agent Paradigms to Future Large-Foundation-Model-Driven Systems
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 24, 2026 · Artificial Intelligence

How Hermes Agent Achieves Self‑Evolution: A Deep Dive into Prompt, Context, and Harness Design

This article provides a detailed technical analysis of Hermes Agent, explaining how its dynamic skill generation and reinforcement‑learning loop enable true self‑evolution, and examines the prompt engineering, context compression, memory architecture, harness mechanisms, error handling, and plugin ecosystem that differentiate it from OpenClaw and Claude Code.

Agent frameworkHermes AgentPrompt engineering
0 likes · 41 min read
How Hermes Agent Achieves Self‑Evolution: A Deep Dive into Prompt, Context, and Harness Design
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Apr 22, 2026 · Artificial Intelligence

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

DeepAries is a novel deep reinforcement‑learning framework that jointly learns when to rebalance a portfolio and how to allocate assets by combining a Transformer‑based state encoder with PPO, and extensive experiments on four major markets show it significantly outperforms fixed‑frequency baselines in risk‑adjusted return, transaction cost, and drawdown.

DeepAriesPPOPortfolio Management
0 likes · 15 min read
How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns
AntTech
AntTech
Apr 22, 2026 · Artificial Intelligence

How Multi‑Agent MCTS and Information‑Gain Rewards Are Transforming Mobile GUI and Search Agents

This article reviews two recent ICLR 2026 papers—M²‑Miner, a multi‑agent Monte‑Carlo Tree Search framework for low‑cost mobile GUI data mining, and IGPO, an information‑gain‑based reinforcement‑learning method that provides dense rewards for multi‑turn search agents—detailing their designs, experiments, and open‑source releases.

GUI Data MiningInformation GainLLM Agents
0 likes · 8 min read
How Multi‑Agent MCTS and Information‑Gain Rewards Are Transforming Mobile GUI and Search Agents
Machine Heart
Machine Heart
Apr 21, 2026 · Artificial Intelligence

Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

Monet introduces a training paradigm that lets multimodal large language models reason directly in a continuous latent visual space, replacing external tool calls with implicit visual embeddings, and demonstrates significant gains on both in‑distribution perception tasks and out‑of‑distribution abstract visual reasoning through a three‑stage supervised fine‑tuning and a novel visual‑latent policy optimization.

Latent EmbeddingMLLMMultimodal
0 likes · 15 min read
Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking
AIWalker
AIWalker
Apr 20, 2026 · Artificial Intelligence

How VA‑π Bridges Tokenizers and Autoregressive Generators for Pixel‑Perfect Images

VA‑π introduces a lightweight post‑training framework that uses variational inference and reinforcement learning to align tokenizers with visual autoregressive generators, achieving dramatic quality gains, extreme training efficiency, and robust pixel‑level reconstruction across diverse image generation tasks.

Autoregressive ModelsPixel Alignmentpost-training
0 likes · 14 min read
How VA‑π Bridges Tokenizers and Autoregressive Generators for Pixel‑Perfect Images
Data Party THU
Data Party THU
Apr 20, 2026 · Artificial Intelligence

How MemPO Uses Reinforcement Learning to Turn Agent Memory into a Trainable Policy

MemPO introduces a self‑memory policy optimization framework that lets long‑horizon LLM agents autonomously manage and refine their memory via reinforcement learning, using global‑trajectory and informative‑memory advantage estimates, achieving up to 25.98% F1 gain and 73% token reduction on benchmark tasks.

LLMLong-Horizon AgentsMemPO
0 likes · 8 min read
How MemPO Uses Reinforcement Learning to Turn Agent Memory into a Trainable Policy
Baidu Maps Tech Team
Baidu Maps Tech Team
Apr 20, 2026 · Artificial Intelligence

How Baidu Maps Reinvents LBS Search with Multi‑Agent AI and RL

Facing the shift from keyword indexing to generative AI, Baidu Maps overhauled its LBS architecture by introducing a native multi‑agent system, context‑engineering (ACE) framework, and reinforcement‑learning alignment, enabling dynamic routing, knowledge evolution, and a 36% boost in planning compliance while maintaining zero‑tolerance for factual errors.

AI agentsLLMLocation‑based services
0 likes · 10 min read
How Baidu Maps Reinvents LBS Search with Multi‑Agent AI and RL
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 19, 2026 · Artificial Intelligence

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

This guide shows how to fine‑tune Qwen3.5 models—from 0.8B to 122B—using Unsloth Studio or pure code, covering text SFT, vision fine‑tuning, MoE models, reinforcement‑learning (GRPO), extensive GGUF quantization benchmarks, hardware requirements, export formats, and deployment tips.

LLMQwen3.5Unsloth
0 likes · 12 min read
From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide
Machine Heart
Machine Heart
Apr 19, 2026 · Artificial Intelligence

World Engine: How Post‑Training Is Launching a New Era of Physical AGI

World Engine introduces a post‑training pipeline that combines high‑fidelity 3DGS simulation, hard‑case mining with diffusion generation, and reinforcement‑learning optimization to give autonomous‑driving models true decision‑making ability, surpassing data‑scaling limits and achieving significant safety gains in both industrial simulations and real‑world tests.

Simulationautonomous drivinghard case mining
0 likes · 11 min read
World Engine: How Post‑Training Is Launching a New Era of Physical AGI
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 16, 2026 · Artificial Intelligence

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

The article analyzes how reward‑shaping techniques can shorten the chain‑of‑thought outputs of Qwen 30‑parameter series models by 20‑40% while preserving or slightly improving performance on AIME‑25 and out‑of‑distribution benchmarks, and it details the experimental design, strategic considerations, and practical insights behind this efficient reasoning approach.

Efficient InferenceQwenreinforcement learning
0 likes · 16 min read
Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%
AI Explorer
AI Explorer
Apr 16, 2026 · Artificial Intelligence

How NVIDIA, HKU, and MIT’s Sol‑RL Framework Supercharges Diffusion Model Training

NVIDIA, Hong Kong University, and MIT introduced the Sol‑RL framework, which uses reinforcement‑learning‑guided sampling to cut diffusion model training time by several‑fold without sacrificing image quality, potentially lowering entry barriers for small teams and shifting the AIGC industry toward an efficiency‑driven competition.

AIGCDiffusion ModelsNVIDIA
0 likes · 6 min read
How NVIDIA, HKU, and MIT’s Sol‑RL Framework Supercharges Diffusion Model Training
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Apr 15, 2026 · Artificial Intelligence

How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines

Relax, an open‑source reinforcement‑learning engine from Xiaohongshu AI Platform, combines service‑oriented fault‑tolerant architecture, a distributed checkpoint service, and an asynchronous training pipeline to achieve up to 76% speed‑up and near‑zero overhead for multi‑modal RL workloads.

Asynchronous PipelineMulti-modalRay Serve
0 likes · 10 min read
How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines
SuanNi
SuanNi
Apr 12, 2026 · Artificial Intelligence

How MemPO Gives AI Agents Long‑Term Memory and Cuts Costs by 70%

The paper introduces MemPO, a self‑memory strategy optimization algorithm that lets large language model agents actively manage their memory, dramatically improving accuracy on complex multi‑step tasks while reducing token consumption by up to 73%, and validates the approach with extensive experiments and analysis.

AIEfficiencyMemory optimization
0 likes · 11 min read
How MemPO Gives AI Agents Long‑Term Memory and Cuts Costs by 70%
CodeTrend
CodeTrend
Apr 11, 2026 · Artificial Intelligence

Inside OpenClaw: Architecture, Core Technologies, and Security Risks

The article provides a detailed technical analysis of the OpenClaw AI‑agent framework, covering its three‑layer architecture, prompt compiler, heartbeat mechanism, file‑based memory, skill system, ReAct loop, model‑agnostic routing, reinforcement‑learning extension, security concerns, and a side‑by‑side comparison with Hermes Agent.

Agent frameworkOpenClawfile-based memory
0 likes · 13 min read
Inside OpenClaw: Architecture, Core Technologies, and Security Risks
Machine Heart
Machine Heart
Apr 11, 2026 · Artificial Intelligence

How 100,000 Hours of Human Data Propelled Psi‑R2 to Lead MolmoSpaces

Lingchu AI demonstrates that scaling human‑operation data to nearly 100,000 hours, combined with a two‑model system and reinforcement learning, can replace costly robot‑teleoperation data and achieve top performance on the MolmoSpaces benchmark.

Embodied AIPsi-R2Psi-W0
0 likes · 12 min read
How 100,000 Hours of Human Data Propelled Psi‑R2 to Lead MolmoSpaces
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Apr 10, 2026 · Artificial Intelligence

Why HermesAgent Outperforms OpenClaw: A Deep Source‑Code Analysis

The article dissects HermesAgent’s architecture, showing how it extends OpenClaw with self‑learning, reinforcement‑learning modules, and advanced prompt‑evolution techniques to mitigate token‑hole costs and achieve more deterministic results, while also detailing its TUI‑driven CLI and evaluation workflow.

DSPyGEPAHermesAgent
0 likes · 8 min read
Why HermesAgent Outperforms OpenClaw: A Deep Source‑Code Analysis
Machine Heart
Machine Heart
Apr 10, 2026 · Artificial Intelligence

AdaGen: Enabling Adaptive, Data‑Driven Strategies for Image Generation Models

AdaGen replaces handcrafted static schedules in multi‑step image generators with a universal, learnable policy network trained via reinforcement learning, using an MDP formulation, adversarial rewards and action smoothing, achieving consistent quality and efficiency gains across diffusion, autoregressive, mask and flow models while adding negligible overhead.

MDPaction smoothingadaptive policy
0 likes · 11 min read
AdaGen: Enabling Adaptive, Data‑Driven Strategies for Image Generation Models
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 9, 2026 · Artificial Intelligence

How Data Flywheels Accelerate Small Agentic Model Training

This article details a data‑flywheel framework for training compact agentic language models, describing synthetic task generation, mock environment simulation, rubric‑based reward design, iterative hard‑sample augmentation, and experimental results that show consistent performance gains across benchmarks.

Data AugmentationSynthetic Environmentsagentic models
0 likes · 17 min read
How Data Flywheels Accelerate Small Agentic Model Training
Machine Heart
Machine Heart
Apr 8, 2026 · Artificial Intelligence

Meta Unveils Muse Spark: The First Model from Its Superintelligence Lab

Meta has launched Muse Spark, its inaugural model from the newly formed Superintelligence Lab, showcasing multimodal capabilities, tool use, visual chain‑of‑thought, and multi‑agent orchestration, while detailing pretraining scaling gains, reinforcement‑learning improvements, and test‑time reasoning efficiencies.

AI scalingMetaMuse Spark
0 likes · 9 min read
Meta Unveils Muse Spark: The First Model from Its Superintelligence Lab
AIWalker
AIWalker
Apr 6, 2026 · Artificial Intelligence

How TIR‑Agent Turns Image‑Restoration Tools into a Learnable Decision‑Making Agent

The paper introduces TIR‑Agent, an image‑restoration agent that learns a tool‑calling policy via supervised fine‑tuning and reinforcement learning, addressing exploration stagnation and multi‑objective reward imbalance, and demonstrates over 2.5× faster inference and superior multi‑metric performance on synthetic and real degradation datasets.

Tool Schedulingagent-based AIcomputer vision
0 likes · 18 min read
How TIR‑Agent Turns Image‑Restoration Tools into a Learnable Decision‑Making Agent
Machine Heart
Machine Heart
Apr 5, 2026 · Artificial Intelligence

Cut Token Costs by 68% with Dynamic Multi‑Agent Collaborative Coding

The paper introduces AgentConductor, a 3‑billion‑parameter orchestrator that generates adaptive YAML‑based multi‑agent topologies, dynamically re‑plans when code errors occur, achieving a 14.6% accuracy boost and up to 68% token‑cost reduction compared to existing static agent pipelines.

AgentConductorLLM code generationYAML topology
0 likes · 9 min read
Cut Token Costs by 68% with Dynamic Multi‑Agent Collaborative Coding
AI Engineer Programming
AI Engineer Programming
Apr 5, 2026 · Artificial Intelligence

How Kimi, Cursor, and Chroma Use Reinforcement Learning to Train Agent Models

The article analyzes three recent technical reports—Moonshot AI's Kimi K2.5, Cursor's Composer 2, and Chroma's Context‑1—detailing how each system trains agent models with reinforcement learning, parallel orchestration, self‑summarization, and self‑editing, and highlights shared methodological themes and performance gains.

Chroma Context-1Cursor ComposerKimi
0 likes · 19 min read
How Kimi, Cursor, and Chroma Use Reinforcement Learning to Train Agent Models
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 4, 2026 · Artificial Intelligence

Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start

The paper reveals that over‑optimizing supervised fine‑tuning (SFT) for large language models can diminish their reinforcement‑learning (RL) potential, proposes an Adaptive Early‑Stop Loss (AESL) that balances accuracy and output diversity during cold‑start, and demonstrates across multiple LLMs that AESL consistently yields superior RL results.

AI trainingAdaptive Early‑Stop LossLLM
0 likes · 11 min read
Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start
Machine Heart
Machine Heart
Apr 3, 2026 · Artificial Intelligence

Beyond Token Entropy: ReLaX Uses Latent Dynamics to Rethink Exploration‑Exploitation in LLM RL

The paper introduces ReLaX, a framework that shifts focus from token‑level entropy to the latent‑space dynamics of large models, employing Koopman operators and a Dynamic Spectral Divergence metric to quantitatively guide exploration‑exploitation balance, and demonstrates state‑of‑the‑art performance on both pure‑text and multimodal RL benchmarks.

Koopman operatorReLaXdynamic spectral divergence
0 likes · 12 min read
Beyond Token Entropy: ReLaX Uses Latent Dynamics to Rethink Exploration‑Exploitation in LLM RL
Machine Heart
Machine Heart
Apr 2, 2026 · Artificial Intelligence

HSImul3R: Bridging Perception and Simulation for Physics‑Ready 3D Human‑Scene Interaction

HSImul3R introduces a physics‑in‑the‑loop reconstruction pipeline that closes the perception‑simulation gap by jointly optimizing human motion and scene geometry, leveraging reinforcement learning, direct simulation‑reward optimization, and a new HSIBench dataset to produce simulation‑ready 3D human‑scene interactions.

3D reconstructionDSROHSIBench
0 likes · 12 min read
HSImul3R: Bridging Perception and Simulation for Physics‑Ready 3D Human‑Scene Interaction
Machine Heart
Machine Heart
Apr 2, 2026 · Artificial Intelligence

Breaking the Multi‑Robot Barrier: Sequential World‑Model Decomposition (ICLR 2026)

SeqWM introduces a sequential causal decomposition of joint dynamics, allowing each robot to model its marginal contribution conditioned on prior agents, which simplifies world‑model learning, enables intent‑sharing planning via MPPI, and achieves superior performance in challenging simulation benchmarks and real‑robot tests.

MPPISeqWMmodel-based RL
0 likes · 7 min read
Breaking the Multi‑Robot Barrier: Sequential World‑Model Decomposition (ICLR 2026)
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Mar 31, 2026 · Artificial Intelligence

Top AI-Driven Quantitative Finance Papers from AAAI 2026

This article curates and summarizes recent AI research papers presented at AAAI 2026 that advance quantitative finance, covering controllable market generation, LLM‑powered alpha factor mining, risk‑aware multi‑agent portfolio management, foundation models for market data, and reinforcement‑learning trading policies.

AIDiffusion ModelsFinancial Market Simulation
0 likes · 12 min read
Top AI-Driven Quantitative Finance Papers from AAAI 2026
Machine Heart
Machine Heart
Mar 31, 2026 · Artificial Intelligence

Can LLM Judges Be Trusted? TrustJudge Leverages Full Probability Distributions

LLM judges often produce contradictory scores and non‑transitive preferences; the TrustJudge framework replaces discrete scoring with distribution‑sensitive scoring and likelihood‑aware aggregation, dramatically reducing both score‑comparison and pairwise‑transitivity inconsistencies across multiple model families, improving accuracy and even serving as a reward signal for RL training.

LLM evaluationReward ModelingTrustJudge
0 likes · 12 min read
Can LLM Judges Be Trusted? TrustJudge Leverages Full Probability Distributions
Shi's AI Notebook
Shi's AI Notebook
Mar 30, 2026 · Artificial Intelligence

AI Daily Digest March 30, 2026: Open‑Source Tools, Model Releases, and Research Highlights

The March 30 AI daily digest curates recent open‑source voice input and TypeScript libraries, new development workflows, a 30B parameter model that runs on 24 GB GPUs, and NVIDIA's PivotRL research that reduces reinforcement‑learning rollouts while matching end‑to‑end performance, all with concrete benchmarks and links.

AI toolsTypeScriptagent workflow
0 likes · 13 min read
AI Daily Digest March 30, 2026: Open‑Source Tools, Model Releases, and Research Highlights
Machine Heart
Machine Heart
Mar 30, 2026 · Artificial Intelligence

Proactive Interaction for Video Multimodal Models: MMDuet2 & ProactiveVideoQA

This article surveys the ICLR 2026 papers ProactiveVideoQA and MMDuet2, detailing how video multimodal large models can decide when to reply autonomously, the PAUC benchmark for evaluating timeliness and accuracy, a reinforcement‑learning training pipeline that requires no precise timestamps, and experimental findings on data construction, frame‑sampling density, and SOTA performance.

MMDuet2PAUCbenchmark
0 likes · 17 min read
Proactive Interaction for Video Multimodal Models: MMDuet2 & ProactiveVideoQA
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Mar 29, 2026 · Artificial Intelligence

How MetaTrader Uses Reinforcement Learning to Boost Trading Strategy Generalization

The article reviews the MetaTrader method, which formulates sequential portfolio optimization as a partially offline reinforcement‑learning problem, introduces a double‑layer RL algorithm and a conservative TD objective to improve out‑of‑distribution generalization, and demonstrates superior performance on CSI‑300 and NASDAQ‑100 datasets compared with existing baselines.

Financial TradingMetaTraderOOD data augmentation
0 likes · 15 min read
How MetaTrader Uses Reinforcement Learning to Boost Trading Strategy Generalization
DataFunSummit
DataFunSummit
Mar 29, 2026 · Artificial Intelligence

How Code Intelligence Is Evolving: From Foundation Models to Repository‑Level Agents

This article reviews the rapid evolution of code intelligence, covering the history of code foundation models, reinforcement‑learning optimizations, scaling‑law insights, the LoopCoder architecture, rigorous multi‑level evaluation suites, and the emergence of repository‑level code agents, while highlighting open‑source contributions such as Qwen‑Coder.

code evaluationcode-intelligencereinforcement learning
0 likes · 15 min read
How Code Intelligence Is Evolving: From Foundation Models to Repository‑Level Agents
Machine Heart
Machine Heart
Mar 29, 2026 · Artificial Intelligence

Scaling World Model Dynamics to Over a Thousand Steps in Two ICLR Papers

The article reviews two ICLR papers by Haoxin Lin that advance world‑model dynamics from single‑step bootstrapping to any‑step direct prediction, introduce structured uncertainty via backtracking, and achieve stable full‑horizon roll‑outs of over a thousand steps, dramatically improving both online and offline reinforcement‑learning performance.

Offline RLany-step predictiondynamics modeling
0 likes · 16 min read
Scaling World Model Dynamics to Over a Thousand Steps in Two ICLR Papers
PaperAgent
PaperAgent
Mar 29, 2026 · Industry Insights

From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development

The article examines the shift from traditional reasoning‑based large‑language‑model pipelines to agentic, harness‑driven AI systems, outlining the definition of a harness, its engineering challenges, architectural components, and the broader implications for training, reinforcement learning, and future research directions.

AI HarnessIntelligent agentsModel Training
0 likes · 16 min read
From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Mar 26, 2026 · Artificial Intelligence

Paper Reading: ArchetypeTrader – A Reinforcement‑Learning Framework for Selecting and Optimizing Crypto Trading Strategies

The article reviews the ArchetypeTrader framework, which addresses market‑segmentation and demonstration‑data issues in crypto‑currency reinforcement learning by discovering discrete trading archetypes, selecting them via a hierarchical RL agent, and refining actions with a regret‑aware adapter, achieving superior profit and risk‑adjusted returns across multiple markets.

cryptocurrency tradinghierarchical reinforcement learningregret-aware optimization
0 likes · 16 min read
Paper Reading: ArchetypeTrader – A Reinforcement‑Learning Framework for Selecting and Optimizing Crypto Trading Strategies
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 25, 2026 · Artificial Intelligence

Scaling Multimodal Reinforcement Learning with NVIDIA Isaac Lab and TiledCamera

This article explains how to use NVIDIA Isaac Lab and the TiledCamera component to run large‑scale, multimodal reinforcement learning on GPU clusters, covering environment setup, noVNC visualization, command‑line execution, distributed training with torchrun, and performance analysis across multiple GPU configurations.

GPU scalingNVIDIA Isaac LabTiledCamera
0 likes · 12 min read
Scaling Multimodal Reinforcement Learning with NVIDIA Isaac Lab and TiledCamera
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Mar 24, 2026 · Artificial Intelligence

How an Interactive Imitation‑Learning Agent Framework Trains Robust Trading Strategies

The article analyzes the simulation‑reality gap in algorithmic trading and proposes an interactive market simulator that combines a pool of imitation‑learning agents, an action‑synthesis network, and a DDPG‑based reinforcement‑learning trader, showing superior robustness and downside protection on QQQ data.

Agent-based ModelingDDPGfinancial AI
0 likes · 16 min read
How an Interactive Imitation‑Learning Agent Framework Trains Robust Trading Strategies
SuanNi
SuanNi
Mar 24, 2026 · Artificial Intelligence

How Memento‑Skills Enables Self‑Evolving LLMs Without Fine‑Tuning

Introducing Memento‑Skills, a novel framework that freezes LLM parameters while an external skill library iteratively reads, writes, and refines capabilities, achieving up to 116% accuracy gains on GAIA and HLE benchmarks and demonstrating scalable self‑evolution without costly model fine‑tuning.

LLMreinforcement learningself-evolution
0 likes · 11 min read
How Memento‑Skills Enables Self‑Evolving LLMs Without Fine‑Tuning
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 22, 2026 · Artificial Intelligence

NS-Diff: Adding a Physics Engine to Diffusion Models for Fluid and Rigid‑Body Dynamics

The CVPR 2026 paper introduces NS‑Diff, a physics‑guided video diffusion framework that combines a noise‑robust dynamics detector, a physical‑condition latent injection module, and reinforcement‑learning optimization to reduce jerk error by 43 % and fluid divergence by 33 %, achieving superior physical realism and visual quality across multiple benchmarks.

CVPR 2026NS‑DiffNavier-Stokes
0 likes · 13 min read
NS-Diff: Adding a Physics Engine to Diffusion Models for Fluid and Rigid‑Body Dynamics
DataFunTalk
DataFunTalk
Mar 22, 2026 · Artificial Intelligence

Why Cursor’s Composer 2 Beats Claude Opus 4.6 in Performance and Price

Cursor’s new Composer 2 programming model outperforms Claude Opus 4.6 on benchmarks like Terminal‑Bench 2.0 and SWE‑bench Multilingual, while slashing token costs to $0.5/​M input and $2.5/​M output, thanks to a novel self‑summary reinforcement‑learning technique that enables efficient long‑context processing.

AILarge Language Modelpricing
0 likes · 8 min read
Why Cursor’s Composer 2 Beats Claude Opus 4.6 in Performance and Price
PaperAgent
PaperAgent
Mar 22, 2026 · Artificial Intelligence

Can LLM Agents Self‑Evolve Without Retraining? Inside Memento‑Skills

The article analyzes the Memento‑Skills framework, which treats external memory as executable skills to enable deployment‑time continual learning for frozen LLM agents, detailing its read‑write reflective loop, skill‑as‑memory design, behavior‑trained skill router, experimental validation on GAIA and HLE benchmarks, and theoretical guarantees without gradient updates.

AIAgentContinual Learning
0 likes · 9 min read
Can LLM Agents Self‑Evolve Without Retraining? Inside Memento‑Skills
AI Engineering
AI Engineering
Mar 21, 2026 · Industry Insights

Is Cursor’s Composer 2 Powered by Kimi? The Truth Is More Complex

A developer uncovered that Cursor’s Composer 2 actually runs on the Kimi K2.5 model with reinforcement learning, prompting a rapid licensing dispute that ended with official confirmation and highlights the opaque yet collaborative nature of today’s open AI model ecosystem.

AI model licensingComposer 2Cursor
0 likes · 4 min read
Is Cursor’s Composer 2 Powered by Kimi? The Truth Is More Complex
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Mar 20, 2026 · Artificial Intelligence

Weekly Quantitative Finance Paper Summaries (Mar 14‑Mar 20, 2026)

This article compiles abstracts of four recent AI‑driven quantitative finance papers, covering an autonomous factor‑investing framework, a program‑level factor‑mining system, an adaptive regime‑aware stock‑price predictor with reinforcement learning, and a comprehensive analysis of AI agents in financial markets.

AI agentsfactor investinglarge language models
0 likes · 10 min read
Weekly Quantitative Finance Paper Summaries (Mar 14‑Mar 20, 2026)
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 20, 2026 · Artificial Intelligence

Cursor’s Composer 2 Beats Claude Opus 4.6 with ‘Ankle‑Cut’ Pricing via New Reinforcement‑Learning Method

Cursor’s newly released Composer 2 model surpasses Claude Opus 4.6 on benchmarks such as Terminal‑Bench 2.0, offers dramatically lower token pricing, and achieves these gains by introducing a novel self‑summary reinforcement‑learning technique that compresses long‑context tasks while preserving critical information.

Composer 2CursorLLM
0 likes · 9 min read
Cursor’s Composer 2 Beats Claude Opus 4.6 with ‘Ankle‑Cut’ Pricing via New Reinforcement‑Learning Method
AI Explorer
AI Explorer
Mar 20, 2026 · Industry Insights

Key AI Breakthroughs and Market Moves on March 20 2026

On March 20 2026, Alibaba’s Qwen 3.5‑Max topped the LMArena blind‑test, OpenAI bought Astral to boost AI coding, Zhejiang University released a real‑time 4D world model, Meta’s Agent leaked data, and a series of AI‑driven innovations from Nvidia, robotics to drug discovery reshaped the industry.

AIAI design toolsAI hardware
0 likes · 7 min read
Key AI Breakthroughs and Market Moves on March 20 2026
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 19, 2026 · Artificial Intelligence

From Solving to Evolving: How RETROAGENT Gives AI Agents Real Retrospective Learning

The article analyzes the RETROAGENT framework, showing how its dual intrinsic feedback and memory‑buffer mechanisms enable LLM agents to move beyond solving tasks toward continual evolution, and presents benchmark results that demonstrate significant performance gains and strong test‑time adaptation across four challenging environments.

LLM AgentsRETROAGENTdual intrinsic feedback
0 likes · 7 min read
From Solving to Evolving: How RETROAGENT Gives AI Agents Real Retrospective Learning
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 19, 2026 · Artificial Intelligence

From Language Modeling to World Modeling: Limits of Large Language Models

Speaker Li Yixia from Southern University of Science and Technology presents a talk on using large language models as textual world models, defining a three‑layer evaluation framework and showing through experiments that fine‑tuned models improve next‑state prediction and agent performance, yet face limits tied to behavior coverage and environment complexity.

agent performanceevaluation frameworklarge language models
0 likes · 4 min read
From Language Modeling to World Modeling: Limits of Large Language Models
Xiaomi Tech
Xiaomi Tech
Mar 18, 2026 · Artificial Intelligence

Xiaomi Unveils MiMo-V2-TTS: Giving Agents a Voice with Soul

Xiaomi introduces MiMo-V2-TTS, a self‑developed speech‑synthesis large model that combines a custom audio tokenizer, multi‑codebook architecture, massive pre‑training on over a hundred million hours of data and multi‑dimensional reinforcement learning to deliver fine‑grained style control, dialect support, role‑play and high‑quality singing, aiming to give AI agents expressive, human‑like voices.

Speech synthesisaudio tokenizerlarge model
0 likes · 6 min read
Xiaomi Unveils MiMo-V2-TTS: Giving Agents a Voice with Soul
AI Explorer
AI Explorer
Mar 17, 2026 · Artificial Intelligence

RISE Enables Breakthrough in Vision‑Language‑Action Learning for Embodied AI

The article examines the limitations of vision‑language‑action (VLA) models in real‑world tasks, explains how the RISE technique from Hong Kong University uses internal simulation, reflection and imagination to cut training costs by an order of magnitude, and discusses its implications for future embodied AI.

Embodied AIRISEVLA
0 likes · 6 min read
RISE Enables Breakthrough in Vision‑Language‑Action Learning for Embodied AI
Software Engineering 3.0 Era
Software Engineering 3.0 Era
Mar 17, 2026 · Artificial Intelligence

How Learning Theory Drives AI‑Powered Software Engineering 3.0

The article explains how machine‑learning theory, especially large‑language‑model training and Reinforcement Learning from Human Feedback, underpins Software Engineering 3.0 by turning code generation into a data‑driven learning process, reshaping cognition, alignment, and continuous system evolution.

Distributed CognitionRLHFlarge language models
0 likes · 12 min read
How Learning Theory Drives AI‑Powered Software Engineering 3.0
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 15, 2026 · Artificial Intelligence

Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods

The MIT‑CSAIL paper introduces RandOpt, a single‑step, gradient‑free, fully parallel post‑training algorithm that adds Gaussian noise to pretrained LLM weights and ensembles the results, achieving or surpassing PPO/GRPO performance by exploiting dense "neural thickets" that emerge as model scale grows.

EnsembleLLMRandOpt
0 likes · 12 min read
Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods
SuanNi
SuanNi
Mar 12, 2026 · Artificial Intelligence

How OpenClaw‑RL Turns Everyday Interactions into Self‑Evolving AI

OpenClaw‑RL, a new reinforcement‑learning framework from Princeton, captures hidden evaluative and instructional signals in daily user interactions, converts them into real‑time training data, and uses a decoupled asynchronous architecture with binary RL and online policy distillation to achieve superior performance in both personal‑device and cloud‑scale scenarios.

AI FeedbackAsynchronous ArchitectureOnline Distillation
0 likes · 10 min read
How OpenClaw‑RL Turns Everyday Interactions into Self‑Evolving AI
AIWalker
AIWalker
Mar 12, 2026 · Artificial Intelligence

BeautyGRPO: RL‑Driven Realistic Portrait Retouching Ends Over‑Beautification (CVPR 2026)

The paper introduces BeautyGRPO, a reinforcement‑learning framework that combines a fine‑grained preference dataset (FRPref‑10K) with Dynamic Path Guidance to balance aesthetic enhancement and high‑fidelity preservation in portrait retouching, achieving superior metrics and user preference over existing SFT and RL models.

AI aestheticsCVPR 2026dynamic path guidance
0 likes · 11 min read
BeautyGRPO: RL‑Driven Realistic Portrait Retouching Ends Over‑Beautification (CVPR 2026)
Didi Tech
Didi Tech
Mar 12, 2026 · Artificial Intelligence

How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens

The STAPO (Spurious‑Token‑Aware Policy Optimization) algorithm, introduced by Tsinghua University's iDLab and Didi's Deep Sea Lab, tackles policy‑entropy instability and performance oscillation in reinforcement‑learning fine‑tuning of large models by mathematically analyzing token collision probability, defining spurious tokens, and applying a Silencing Spurious Tokens mechanism that yields state‑of‑the‑art results on multiple math‑reasoning benchmarks.

AI safetySTAPOfine-tuning
0 likes · 7 min read
How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens
DataFunTalk
DataFunTalk
Mar 11, 2026 · Artificial Intelligence

Agent Lightning: Decoupling Optimizers to Empower AI Agents via Reinforcement Learning

Agent Lightning, an open‑source system from Microsoft Research Asia, introduces a novel optimizer‑agent disaggregation architecture that enables any AI agent to benefit from reinforcement learning, offering non‑intrusive experience capture, programmable pipelines, and flexible signal passing, while addressing real‑world challenges of scalability, multi‑step tasks, and zero‑code integration.

Agent LightningExperience CaptureLearning Systems
0 likes · 21 min read
Agent Lightning: Decoupling Optimizers to Empower AI Agents via Reinforcement Learning
DataFunSummit
DataFunSummit
Mar 10, 2026 · Artificial Intelligence

How Agent Lightning Redefines AI Agent Learning with Optimizer‑Agent Decoupling

The article explores the paradigm shift toward AI agents in 2025, detailing the open‑source Agent Lightning project’s architecture, non‑intrusive experience capture, programmable pipelines, and experimental results that demonstrate its ability to enable reinforcement learning for any agent with minimal code changes.

Agent LightningOpen Source Frameworkmachine learning
0 likes · 20 min read
How Agent Lightning Redefines AI Agent Learning with Optimizer‑Agent Decoupling
PaperAgent
PaperAgent
Mar 10, 2026 · Artificial Intelligence

How MemSifter Delivers High‑Precision, Low‑Cost Long‑Term Memory for LLMs

MemSifter introduces a lightweight agent that outsources memory retrieval for large language models, using a Think‑and‑Rank pipeline and a task‑result‑oriented reinforcement‑learning training paradigm to achieve superior retrieval accuracy and efficiency across eight benchmark tasks while keeping inference overhead minimal.

AgentEfficiencyLLM
0 likes · 13 min read
How MemSifter Delivers High‑Precision, Low‑Cost Long‑Term Memory for LLMs
AI Explorer
AI Explorer
Mar 6, 2026 · Artificial Intelligence

AReaL: Lightning‑Fast Asynchronous RL Engine for Building High‑Performance LLM Agents

AReaL, an open‑source, fully asynchronous reinforcement‑learning platform co‑developed by Tsinghua University and Ant Group, dramatically speeds up training of complex LLM agents, offering a simple, stable, and hardware‑flexible solution for developers seeking industrial‑grade AI agents.

AI InfrastructureAReaLAsynchronous Training
0 likes · 7 min read
AReaL: Lightning‑Fast Asynchronous RL Engine for Building High‑Performance LLM Agents
Tencent Cloud Developer
Tencent Cloud Developer
Mar 5, 2026 · Artificial Intelligence

20 Cutting‑Edge RAG Optimization Techniques: From Semantic Chunking to Self‑RAG

This article systematically presents twenty practical RAG (Retrieval‑Augmented Generation) optimization methods—covering semantic chunking, chunk‑size evaluation, context‑enhanced retrieval, query transformation, re‑ranking, feedback loops, multimodal and graph RAG, hierarchical retrieval, HyDE, Self‑RAG and reinforcement‑learning‑enhanced RAG—each with clear Python code examples, advantages, limitations and ideal use‑cases.

AILLMRAG
0 likes · 57 min read
20 Cutting‑Edge RAG Optimization Techniques: From Semantic Chunking to Self‑RAG
Kuaishou Tech
Kuaishou Tech
Mar 4, 2026 · Artificial Intelligence

How LLMs Are Revolutionizing Reinforcement Learning for Recommendation Systems

This survey examines the emerging LLM‑RL collaborative recommendation paradigm, outlining its research background, five main collaboration patterns, standardized evaluation protocols, and the key challenges and future directions for building smarter, more robust recommender systems.

LLMRecommendation Systemsartificial-intelligence
0 likes · 14 min read
How LLMs Are Revolutionizing Reinforcement Learning for Recommendation Systems
Woodpecker Software Testing
Woodpecker Software Testing
Mar 4, 2026 · Artificial Intelligence

Deep Dive into Adversarial Testing Performance Optimization for AI Systems

The article examines Adversarial Testing Performance Optimization (ATPO) as a new industrial-quality paradigm, detailing how adversarial samples expose hidden performance bottlenecks across AI pipelines, presenting three typical adversarial loads with corresponding optimization targets, common implementation pitfalls, and emerging intelligent approaches using reinforcement learning and digital twins.

AI pipelinesDigital TwinPerformance Optimization
0 likes · 8 min read
Deep Dive into Adversarial Testing Performance Optimization for AI Systems
PaperAgent
PaperAgent
Mar 3, 2026 · Artificial Intelligence

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

The article presents CharacterFlywheel, a 15‑generation flywheel methodology that iteratively improves social‑dialogue LLMs in production using data‑driven reward models, rejection sampling, and a mix of SFT, DPO, and RL, with detailed experiments and best‑practice insights.

AI safetyLLM OptimizationReward Modeling
0 likes · 12 min read
How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization
PaperAgent
PaperAgent
Mar 2, 2026 · Artificial Intelligence

SKILLRL: Boosting LLM Agents with Skill Distillation and Recursive Evolution

SKILLRL introduces a novel framework that transforms raw LLM agent trajectories into compact, reusable skills via experience‑driven distillation, hierarchical skill banks, and recursive skill evolution, achieving up to 90% success on ALFWorld and 73% on WebShop while reducing token usage by over 10% compared to memory‑based baselines.

LLM AgentsSKILLRLhierarchical skill bank
0 likes · 10 min read
SKILLRL: Boosting LLM Agents with Skill Distillation and Recursive Evolution
AI Explorer
AI Explorer
Mar 2, 2026 · Artificial Intelligence

OpenSandbox: Alibaba’s Open‑Source AI Sandbox for Secure, Scalable Agent Execution

OpenSandbox, an open‑source sandbox platform from Alibaba, offers a unified, secure, and extensible execution environment for AI agents, code execution, and reinforcement‑learning workloads, leveraging Docker and high‑performance Kubernetes runtimes, with multi‑language SDKs and fine‑grained network controls.

AI agentsAI sandboxDocker
0 likes · 7 min read
OpenSandbox: Alibaba’s Open‑Source AI Sandbox for Secure, Scalable Agent Execution
Xiaomi Tech
Xiaomi Tech
Mar 2, 2026 · Artificial Intelligence

How Xiaomi’s Tactile‑Enabled Robot Graduates from Lab to Automotive Assembly Line

The article details Xiaomi Robotics' transition of its VLA‑based robot with TacRefineNet tactile perception from laboratory experiments to a real automotive factory, achieving a 90.2% dual‑side success rate over three hours while meeting a 76‑second production cycle, and explains the end‑to‑end data‑driven control, multimodal sensing, whole‑body motion strategy, failure cases, and open resources.

TacRefineNetVLA modelXiaomi Robotics
0 likes · 8 min read
How Xiaomi’s Tactile‑Enabled Robot Graduates from Lab to Automotive Assembly Line
AI Frontier Lectures
AI Frontier Lectures
Feb 28, 2026 · Artificial Intelligence

Can Reinforcement Learning Revolutionize Text-to-3D Generation? A Deep Dive

This article presents a systematic investigation of applying reinforcement learning to text‑to‑3D generation, detailing reward design, algorithm selection, a new 3D benchmark, a hierarchical GRPO framework, extensive ablations, and the resulting performance gains and limitations.

AI researchgenerative modelsreinforcement learning
0 likes · 13 min read
Can Reinforcement Learning Revolutionize Text-to-3D Generation? A Deep Dive
Baobao Algorithm Notes
Baobao Algorithm Notes
Feb 24, 2026 · Artificial Intelligence

The Bitter Lesson of Building Agentic RL in Terminal Environments

This article recounts the challenges of moving from single‑step RL with verifiable rewards to multi‑step agentic reinforcement learning in terminal environments, detailing infrastructure design, asynchronous pipelines, data quality checks, masking strategies, curriculum training, chunk‑based optimization, and practical lessons learned from large‑scale experiments.

Agentic RLAsynchronous TrainingCredit Assignment
0 likes · 33 min read
The Bitter Lesson of Building Agentic RL in Terminal Environments
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 23, 2026 · Artificial Intelligence

System Engineering Behind Billions of Parameters: Insider Training Details from Seven Top AI Labs

This article systematically dissects the engineering decisions behind frontier large‑language‑model training—covering architecture choices, attention variants, optimizer evolution, data‑curation strategies, scaling‑law insights, and post‑training SFT/RL pipelines—based on open‑source reports from seven leading AI laboratories.

Mixture of ExpertsModel Traininglarge language models
0 likes · 26 min read
System Engineering Behind Billions of Parameters: Insider Training Details from Seven Top AI Labs
HyperAI Super Neural
HyperAI Super Neural
Feb 19, 2026 · Artificial Intelligence

World Model & VLA Breakthroughs: Top Papers from NVIDIA, ByteDance, Tsinghua and Others

This roundup highlights six recent embodied AI papers that advance world models and vision‑language‑action (VLA) techniques, covering DreamDojo's massive first‑person video model, LingBot‑World simulator, Agent World Model generator, BagelVLA, ACoT‑VLA, and the closed‑loop World‑VLA‑Loop framework.

Embodied AISynthetic Environmentsreinforcement learning
0 likes · 8 min read
World Model & VLA Breakthroughs: Top Papers from NVIDIA, ByteDance, Tsinghua and Others
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 18, 2026 · Artificial Intelligence

Microsoft’s 671B LLM Unifies Offline Ad Tasks—Can It Cut Compute Costs?

Microsoft’s AdNanny replaces a forest of specialized offline models with a single 671 B LLM, using a three‑stage data factory to generate reasoning‑rich corpora, dynamic task re‑weighting, RL‑based metric alignment, and a hybrid 31‑pipeline‑parallel architecture that halves compute cost while boosting performance on core ad‑ranking tasks.

AdNannyLLMdynamic weighting
0 likes · 9 min read
Microsoft’s 671B LLM Unifies Offline Ad Tasks—Can It Cut Compute Costs?
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 16, 2026 · Artificial Intelligence

Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide

This article provides a detailed analysis of Qwen3.5, covering its multimodal MoE design, massive inference speedups, extensive benchmark results against GPT‑5.2, Claude 4.5 Opus and Gemini‑3 Pro, RL scaling strategies, training infrastructure innovations, and practical usage via API and local deployment.

FP8 trainingLarge Language ModelMultimodal AI
0 likes · 13 min read
Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 15, 2026 · Artificial Intelligence

Embedding Error Correction into the Policy Space: How Search‑R2 Redefines Search‑Enhanced Reasoning

The Search‑R2 framework integrates error detection, localization, and correction into a reinforcement‑learning loop for search‑enhanced reasoning, achieving notably larger accuracy gains on difficult multi‑hop QA tasks than baseline methods, even when those baselines receive higher sampling budgets.

Agentic AIError CorrectionMulti-hop QA
0 likes · 15 min read
Embedding Error Correction into the Policy Space: How Search‑R2 Redefines Search‑Enhanced Reasoning