Tagged articles

reinforcement learning

743 articles · Page 1 of 8
IT Services Circle
IT Services Circle
Jul 3, 2026 · Artificial Intelligence

Ornith-1.0: The New Open‑Source Agentic Coding King with MIT License

Ornith-1.0, an open‑source model family released under the MIT license, tops multiple Agentic Coding benchmarks (SWE‑Bench Verified 82.4, Terminal‑Bench 77.5, etc.), spans from 9B to 397B parameters, and introduces joint reinforcement‑learning optimization of scaffold and solution to reshape AI‑assisted programming.

AI coding agentsOrnith-1.0agentic coding
0 likes · 13 min read
Ornith-1.0: The New Open‑Source Agentic Coding King with MIT License
Machine Heart
Machine Heart
Jul 3, 2026 · Artificial Intelligence

ICML 2026: Enabling Multimodal Large Models to Reason Over Time with the Open‑Source TaRO Framework

The paper introduces the Temporal‑Aware Reasoning Optimization (TaRO) framework, which equips multimodal video large models with time‑aware reasoning via template‑based exploration, a temporal‑sensitivity reward, and progressive curriculum learning, achieving state‑of‑the‑art zero‑shot performance on several video temporal grounding benchmarks, including long‑video datasets.

Multimodal LearningTaROTemporal Reasoning
0 likes · 9 min read
ICML 2026: Enabling Multimodal Large Models to Reason Over Time with the Open‑Source TaRO Framework
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jul 2, 2026 · Artificial Intelligence

Perfect Scores, Hidden Flaws: Qwen & Fudan Reveal Coding Agent Reward Issues

The article analyses how coding agents exploit unit‑test rewards by rewriting tests, explains why reward signals are only proxies for underspecified human intent, and argues that trustworthy AI requires a co‑evolving verification system rather than a single perfect validator.

AI safetycoding agentshuman intent
0 likes · 19 min read
Perfect Scores, Hidden Flaws: Qwen & Fudan Reveal Coding Agent Reward Issues
Machine Heart
Machine Heart
Jul 2, 2026 · Artificial Intelligence

Perfect Scores, Hidden Flaws: Qwen and Fudan Expose Reward Design Dilemmas in Coding Agents

The article analyzes how coding agents can game test‑based rewards by altering verification signals, argues that reward signals are merely proxies for human intent, and proposes a co‑evolving verification system—combining scalable, faithful, and robust components—to reliably guide reinforcement‑learning agents.

AI safetycoding agentsinteractive judge
0 likes · 20 min read
Perfect Scores, Hidden Flaws: Qwen and Fudan Expose Reward Design Dilemmas in Coding Agents
Machine Heart
Machine Heart
Jul 2, 2026 · Artificial Intelligence

EMCES: How Episodic Memory Guides Controllable Sample Synthesis to Boost Reinforcement Learning

The paper introduces EMCES, a method that injects episodic memory into controllable diffusion models and uses a hash‑based state representation to generate high‑value synthetic samples, dramatically improving sample efficiency and downstream reinforcement‑learning performance while cutting storage and time costs.

Diffusion ModelsEpisodic MemoryHashing
0 likes · 14 min read
EMCES: How Episodic Memory Guides Controllable Sample Synthesis to Boost Reinforcement Learning
Tencent Cloud Developer
Tencent Cloud Developer
Jun 30, 2026 · Artificial Intelligence

Why Claude Leads in Code Generation: A Deep Dive into Its Systemic Advantage

The article analyses why Claude’s code‑writing ability outperforms rivals, tracing its edge to a combination of verifiable‑reward reinforcement learning, Constitutional AI safety guards, a product‑driven data flywheel, multi‑level reward shaping, and continuous human‑in‑the‑loop evaluation on benchmarks such as SWE‑bench.

AI safetyAnthropicClaude
0 likes · 34 min read
Why Claude Leads in Code Generation: A Deep Dive into Its Systemic Advantage
AI Architecture Hub
AI Architecture Hub
Jun 30, 2026 · Artificial Intelligence

How to Fine‑Tune LLMs in 2026: Overcome the 30‑40% Error Wall with GRPO and RULER

Teams building LLM‑powered products often hit a wall where 30‑40% of responses are wrong and the model never learns from mistakes; the article explains how modern fine‑tuning using GRPO‑based reinforcement learning and the open‑source ART framework, together with the RULER reward‑free evaluator, lets small open‑source models surpass larger ones in cost, latency, and accuracy.

ART frameworkGRPOLLM fine-tuning
0 likes · 9 min read
How to Fine‑Tune LLMs in 2026: Overcome the 30‑40% Error Wall with GRPO and RULER
Machine Heart
Machine Heart
Jun 29, 2026 · Artificial Intelligence

How MWA™'s Long‑Sequence Bidirectional Physical Causal Chain Sets a New Record in Embodied AI

The article presents MWA™, the first long‑sequence bidirectional physical causal chain hidden‑space world model, details its bidirectional dynamics, latent‑action pre‑training, three‑gradient constraints and AnyPhys negative‑sample system, and shows it achieved a 75.2% success rate on the RoboCasa GR1 TableTop benchmark, surpassing leading competitors.

AnyPhysEmbodied AIRoboCasa benchmark
0 likes · 14 min read
How MWA™'s Long‑Sequence Bidirectional Physical Causal Chain Sets a New Record in Embodied AI
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 28, 2026 · Artificial Intelligence

Why the Log‑Ratio Reward in OPD Is Fundamentally Flawed and Should Be Replaced

The paper reveals that the unbounded log‑ratio reward used in vanilla On‑Policy Distillation causes extreme gradient variance, early‑stage instability, and poor final performance, and demonstrates that replacing the log with a bounded Box‑Cox power transform (PowerOPD) resolves these issues while improving accuracy, efficiency, and memory usage.

Box-CoxOPDlarge language models
0 likes · 16 min read
Why the Log‑Ratio Reward in OPD Is Fundamentally Flawed and Should Be Replaced
Machine Heart
Machine Heart
Jun 28, 2026 · Artificial Intelligence

Can AI Learn on the Job? RLVR, OPSD, and Dreaming for the Next‑Gen Training Paradigm

The article examines Dwarkesh Patel’s view that future AI must move beyond one‑off pre‑training to continual, on‑the‑job learning, discussing Reinforcement Learning with Verifiable Rewards (RLVR), the need for "grindable" tasks, and emerging approaches like on‑policy self‑distillation (OPSD) and "dreaming" to write real‑world experience back into model weights.

AI Training ParadigmsContinual LearningDreaming
0 likes · 12 min read
Can AI Learn on the Job? RLVR, OPSD, and Dreaming for the Next‑Gen Training Paradigm
Machine Heart
Machine Heart
Jun 28, 2026 · Artificial Intelligence

Why Robot AI Is Harder Than Large‑Scale Models: A First‑Principles Analysis

The article breaks down robot AI to a simple function mapping observations to actions, explains why latency, data diversity, and the need for split architectures make it far more challenging than training large language models, and surveys current solutions from edge‑cloud trade‑offs to action‑chunking and self‑learning.

AICloud Computingaction chunking
0 likes · 17 min read
Why Robot AI Is Harder Than Large‑Scale Models: A First‑Principles Analysis
Data Party THU
Data Party THU
Jun 27, 2026 · Artificial Intelligence

AI and Chemists Co-Develop TYR Inhibitors via Dual-Track Optimization

The study presents a dual-track strategy that combines deep reinforcement‑learning‑driven de novo molecular generation with expert‑guided medicinal chemistry to discover and optimize TYR inhibitors, demonstrating how AI expands chemical space while chemists ensure synthetic feasibility, leading to potent candidates such as AI10‑m15 with strong anti‑melanogenesis activity.

AI-driven drug discoveryTYR inhibitorchemical space exploration
0 likes · 8 min read
AI and Chemists Co-Develop TYR Inhibitors via Dual-Track Optimization
Black & White Path
Black & White Path
Jun 25, 2026 · Artificial Intelligence

Can DeepSeek‑V4‑Fable’s AI Make Red Teams Redundant?

DeepSeek‑V4‑Fable, an autonomous AI agent built on a Chinese large‑model foundation and refined with SFT and GRPO, achieves a 58.7% overall solve rate on 300 held‑out CTF challenges, prompting a debate on its impact on red‑team workflows and security governance.

AICTFDeepSeek-V4-Fable
0 likes · 9 min read
Can DeepSeek‑V4‑Fable’s AI Make Red Teams Redundant?
Machine Heart
Machine Heart
Jun 24, 2026 · Industry Insights

Are Humanoid Robots Being Designed for Simulators? A Veteran’s Warning

The article warns that humanoid robot designers are sacrificing mechanical advantages—such as parallel joints and tendon‑driven hands—to make hardware easier for simulation, turning robust engineering principles into a simulation‑driven shortcut that risks limiting real‑world performance.

HardwareSimulationhumanoid robots
0 likes · 9 min read
Are Humanoid Robots Being Designed for Simulators? A Veteran’s Warning
Machine Heart
Machine Heart
Jun 24, 2026 · Artificial Intelligence

How APEIRIA Breaks the Black‑Box Barrier of 3D MLLMs (ICML 2026)

The paper introduces APEIRIA, a three‑stage curriculum that distills neuro‑symbolic program traces into 3D multi‑modal LLMs, enabling transparent spatial reasoning while preserving open‑vocabulary understanding, and demonstrates strong benchmark gains, modular upgrades, and zero‑shot generalization.

3D MLLMChain-of-ThoughtModular AI
0 likes · 11 min read
How APEIRIA Breaks the Black‑Box Barrier of 3D MLLMs (ICML 2026)
Ops Development & AI Practice
Ops Development & AI Practice
Jun 23, 2026 · Artificial Intelligence

Sovereign‑Free Routing: How Sakana AI’s Fugu Beats Claude Fable 5 Amid Geopolitical Constraints

Sakana AI’s newly released Fugu system uses a tiny 7B “commander” model to dynamically orchestrate a pool of global and local AI models, achieving a 73.7 % SWE‑bench Pro score that outperforms GPT‑5.5 and the heavily sanctioned Claude Fable 5, while illustrating a sovereign‑free routing strategy born from geopolitical and compute limitations.

AI GeopoliticsBenchmarkingEvolutionary Algorithms
0 likes · 8 min read
Sovereign‑Free Routing: How Sakana AI’s Fugu Beats Claude Fable 5 Amid Geopolitical Constraints
AI Architecture Hub
AI Architecture Hub
Jun 23, 2026 · Artificial Intelligence

Top AI Papers This Week (June 14‑21): SpatialClaw, SkillWeaver, PreAct, and More

This article reviews seven recent AI research papers, detailing how SpatialClaw enables code‑based spatial reasoning for vision‑language models, SkillWeaver introduces compositional skill routing, PreAct compiles agent actions into reusable state‑machines, and other works advance world‑model inference, self‑designing RL environments, collective skill‑tree search, and process‑aligned reinforcement learning for diffusion LLMs.

Diffusion Modelsagent reasoninglarge language models
0 likes · 15 min read
Top AI Papers This Week (June 14‑21): SpatialClaw, SkillWeaver, PreAct, and More
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 21, 2026 · Artificial Intelligence

xOPD Evolution: Mapping Recent OPD Improvements – Rephrased Same Problems vs. New Modules

This article surveys the latest on‑policy distillation (OPD) research, categorizing each work as either a reinterpretation of an existing problem or a modification of a different module, and highlights the experimental findings, design choices, and trade‑offs reported across the papers.

LLMModel EfficiencyOPD
0 likes · 31 min read
xOPD Evolution: Mapping Recent OPD Improvements – Rephrased Same Problems vs. New Modules
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 21, 2026 · Artificial Intelligence

Rank‑Only Rewards Accelerate One‑Step Text‑to‑Image Preference Optimization 3.5×

DrPO introduces a drifting‑field based, rank‑only reward mechanism for one‑step text‑to‑image models, enabling reinforcement‑learning‑after‑training without back‑propagating reward gradients; it speeds up training 3.51× versus DRaFT, works with non‑differentiable rewards, and improves generation quality on SD‑Turbo and SDXL‑Turbo.

DrPODrifting ModelHPSv3
0 likes · 11 min read
Rank‑Only Rewards Accelerate One‑Step Text‑to‑Image Preference Optimization 3.5×
Machine Heart
Machine Heart
Jun 21, 2026 · Artificial Intelligence

Why the Once‑Rejected PPO Algorithm Became a Pillar of Modern LLM Training

The article recounts how Proximal Policy Optimization, initially dismissed by NeurIPS 2017 for limited novelty, later became a cornerstone of RLHF and large‑language‑model training, illustrating how academic evaluation can miss long‑term impact, with parallels to other once‑rejected breakthroughs such as LSTM, SIFT and Dropout.

Algorithm RejectionNeurIPSPPO
0 likes · 5 min read
Why the Once‑Rejected PPO Algorithm Became a Pillar of Modern LLM Training
PaperAgent
PaperAgent
Jun 21, 2026 · Artificial Intelligence

What Drives AI Model Evolution? OpenAI’s New Findings on Beneficial Traits

OpenAI’s latest study shows that injecting just 5% of beneficial‑trait data into reinforcement‑learning training yields over 80% improvement across more than 50 alignment evaluations, revealing that a few underlying personality traits drive cross‑domain alignment and persist under adversarial pressure.

AI alignmentadversarial robustnessbeneficial traits
0 likes · 12 min read
What Drives AI Model Evolution? OpenAI’s New Findings on Beneficial Traits
Machine Heart
Machine Heart
Jun 21, 2026 · Artificial Intelligence

Is GRPO Obsolete? Why GLM‑5.2 Dropped It and What It Means for RL

GLM‑5.2 replaces the Group Relative Policy Optimization (GRPO) algorithm with a critic‑based PPO approach for long‑horizon tasks, arguing that GRPO’s group comparison breaks down on variable‑length trajectories, a shift that has sparked vigorous debate across the reinforcement‑learning community.

DeepSeekGLM-5.2GRPO
0 likes · 10 min read
Is GRPO Obsolete? Why GLM‑5.2 Dropped It and What It Means for RL
Machine Heart
Machine Heart
Jun 20, 2026 · Artificial Intelligence

DrPO: Ranking‑Only Rewards Boost One‑Step Text‑to‑Image Preference Optimization by 3.51×

DrPO introduces a ranking‑only reward that builds a drift field from on‑policy image samples to fine‑tune one‑step text‑to‑image models, achieving up to 3.51× faster training on large multimodal rewards, supporting non‑differentiable signals, and demonstrating superior quality across multiple benchmarks.

Drifting Preference Optimizationdrift fieldnon-differentiable reward
0 likes · 14 min read
DrPO: Ranking‑Only Rewards Boost One‑Step Text‑to‑Image Preference Optimization by 3.51×
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 18, 2026 · Artificial Intelligence

From Imitation to Optimization: Recent Advances in On-Policy Distillation

This article surveys the latest research on On-Policy Distillation for large language models, covering methods that improve training stability, self‑distillation frameworks, and detailed analyses of when and why OPD succeeds or fails, with concrete experimental results and practical insights.

Entropy-AwareOn‑Policy DistillationSelf‑Distillation
0 likes · 19 min read
From Imitation to Optimization: Recent Advances in On-Policy Distillation
Kuaishou Tech
Kuaishou Tech
Jun 18, 2026 · Artificial Intelligence

Kuaishou Tech Team Highlights Multiple ICML 2026 Papers Across AI Domains

The Kuaishou technology team reports that several of its papers were accepted at the prestigious ICML 2026 conference—including a spotlight paper on metaphor video understanding, works on causal discovery for irregular time series, image super‑resolution, large‑scale notification dispatch, full‑order ranking, phase‑aware MoE for RL, end‑to‑end e‑commerce search, spatial‑reasoning rewards, a unified SWE benchmark, video temporal grounding, and interpretable transformers—while also inviting attendees to visit their booth B101 in Seoul.

Agentic AIICML 2026Kuaishou
0 likes · 18 min read
Kuaishou Tech Team Highlights Multiple ICML 2026 Papers Across AI Domains
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 18, 2026 · Artificial Intelligence

Can a 3B Model Rival Claude Opus 4.5? Benchmark Gaps or Aggressive Post‑Training?

VibeThinker‑3B, a 3‑billion‑parameter language model built on Qwen2.5‑Coder‑3B, achieves scores within the range of 671 B‑parameter models on benchmarks such as LiveCodeBench, AIME26, IMO‑AnswerBench and GPQA, thanks to a two‑stage SFT, multi‑domain reinforcement learning, offline self‑distillation and a claim‑reliability (CLR) evaluator that together push its reasoning ability to the frontier.

Parameter EfficiencyVibeThinker-3Bbenchmark performance
0 likes · 9 min read
Can a 3B Model Rival Claude Opus 4.5? Benchmark Gaps or Aggressive Post‑Training?
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 18, 2026 · Artificial Intelligence

TNT: Dynamic Token Limits Slash Reward Hacking in Mixed Inference Models Below 10%

The paper introduces Thinking‑Based Non‑Thinking (TNT), a reinforcement‑learning approach that sets a per‑question dynamic token ceiling for non‑thinking mode using the answer length from thinking mode, cutting reward‑hacking incidence to under 10% while boosting accuracy and cutting token usage by nearly half across several math benchmarks.

ACL 2026NLPTNT
0 likes · 10 min read
TNT: Dynamic Token Limits Slash Reward Hacking in Mixed Inference Models Below 10%
Machine Heart
Machine Heart
Jun 17, 2026 · Artificial Intelligence

Why Massive GPU Farms Still Fail to Deliver Enterprise‑Ready AI—and How Jiuzhang’s AI Factory Solves It

Despite a surge to over 140 trillion daily token calls in China, enterprises find general large models can answer but cannot execute business workflows, a gap Jiuzhang Yunji addresses with its AI Factory that combines reinforcement‑learning‑driven professional model production, a five‑capability training platform, and an Inference OS to industrialize AI at scale.

AI Infrastructureindustrial AIlarge models
0 likes · 22 min read
Why Massive GPU Farms Still Fail to Deliver Enterprise‑Ready AI—and How Jiuzhang’s AI Factory Solves It
Machine Heart
Machine Heart
Jun 17, 2026 · Artificial Intelligence

Why RL‑Trained Agents Still Fail to Reason Actively: The Information Self‑Locking Problem

The paper reveals that outcome‑based reinforcement learning often traps LLM agents in an information self‑locking regime where weak action selection and belief tracking prevent proper credit assignment, and introduces AREW, a lightweight advantage‑reweighting method that restores active reasoning across multiple tasks and models.

AREWAgentic RLLLM Agents
0 likes · 24 min read
Why RL‑Trained Agents Still Fail to Reason Actively: The Information Self‑Locking Problem
Machine Heart
Machine Heart
Jun 15, 2026 · Artificial Intelligence

HyVLA-0.5: Sub‑millimeter UMI Data and Real‑Robot Reinforcement Eliminate Heavy Tele‑operation

HyVLA-0.5, an open‑source embodied VLA model from Tencent Robotics X, leverages over 10,000 hours of sub‑millimeter UMI demonstration data and a novel FlowPRO reinforcement pipeline to achieve more than 90% success on simulated and real‑world tasks, while supporting cross‑embodiment transfer and asynchronous deployment.

FlowPROHyVLA-0.5UMI data
0 likes · 16 min read
HyVLA-0.5: Sub‑millimeter UMI Data and Real‑Robot Reinforcement Eliminate Heavy Tele‑operation
Top Architect
Top Architect
Jun 13, 2026 · Artificial Intelligence

What Is an Inference Large Language Model? A Visual Guide

The article explains inference‑type large language models, how they differ from traditional models by breaking questions into reasoning steps, the shift from training‑time to test‑time compute, scaling‑law insights, validation techniques, proposal‑distribution tricks, and the detailed training pipeline of DeepSeek‑R1, while also discussing failed experiments and future directions.

DeepSeek-R1inference modelslarge language models
0 likes · 20 min read
What Is an Inference Large Language Model? A Visual Guide
Bilibili Tech
Bilibili Tech
Jun 12, 2026 · Artificial Intelligence

A New UGC Video Evaluation Paradigm Built on 17 Billion Real User Interactions

The paper introduces CASTER, a multimodal AI system that uses Social‑CoT reasoning and the MEDEA framework to simulate diverse audience reactions, benchmarked on the large‑scale CASTER‑Bench dataset, and demonstrates superior performance over GPT‑5.2, Claude‑4.5‑Opus, and traditional VQA methods while already being deployed on Bilibili.

Community resonanceMultimodal AISocial CoT
0 likes · 9 min read
A New UGC Video Evaluation Paradigm Built on 17 Billion Real User Interactions
PaperAgent
PaperAgent
Jun 11, 2026 · Artificial Intelligence

184 Ready-to-Use PINN Innovations Powering Nature‑Level Research

The article compiles 184 practical PINN innovations—including theory advances, new training paradigms, and integrations with Bayesian methods, reinforcement learning, Transformers, and graph neural networks—along with ready-to-use source code and starter resources for researchers seeking cutting‑edge physics‑informed neural network solutions.

Adaptive MethodsGraph Neural NetworksPINN
0 likes · 7 min read
184 Ready-to-Use PINN Innovations Powering Nature‑Level Research
Machine Heart
Machine Heart
Jun 10, 2026 · Artificial Intelligence

Can AI Bridge the College Application Gap? Alibaba’s Free Volunteer‑Filling Agent Tested by 400K AI Candidates

Alibaba’s free Qianwen high‑school volunteer‑filling Agent combines a knowledge base of 3,000 schools, proactive calendar planning, persistent memory and reinforcement‑learning‑trained LLM to guide 12.9 million candidates, and its performance was stress‑tested with 400,000 simulated AI applicants.

AI AgentCollege AdmissionsEducation Technology
0 likes · 10 min read
Can AI Bridge the College Application Gap? Alibaba’s Free Volunteer‑Filling Agent Tested by 400K AI Candidates
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 10, 2026 · Artificial Intelligence

OneReason: Enabling Recommendation Systems to Reason

OneReason introduces a systematic reasoning capability into industrial recommendation models through multi‑stage pre‑training, chain‑of‑thought fine‑tuning, and reinforcement learning, achieving significant gains in click‑through, revenue, and cross‑domain recommendation performance while preserving the underlying language abilities of the base model.

Chain-of-ThoughtRecommendation Systemsindustrial AI
0 likes · 29 min read
OneReason: Enabling Recommendation Systems to Reason
Machine Heart
Machine Heart
Jun 9, 2026 · Artificial Intelligence

OneReason: When Recommendation Systems Learn to Reason

The OneReason report details how Kuaishou’s recommendation team injects reasoning into large‑scale recommender models through a four‑level pre‑training pipeline, chain‑of‑thought (CoT) fine‑tuning, and specialized reinforcement learning, achieving significant offline gains and a 10.33% exposure lift in a live A/B test.

CoTIndustryLLM
0 likes · 31 min read
OneReason: When Recommendation Systems Learn to Reason
DataFunSummit
DataFunSummit
Jun 6, 2026 · Artificial Intelligence

From Traffic Links to Task Management: 1688’s Agentic AI Evolution

The article details how 1688 transformed its platform from a traditional intent‑matching traffic hub into an Agentic AI system that understands business tasks, outlining a three‑step implementation of knowledge, trajectory and environment redesign, dual‑track evolution, novel evaluation methods, and the emerging role of product managers as evaluation engineers.

Agentic AILarge Language ModelRetrieval-Augmented Generation
0 likes · 13 min read
From Traffic Links to Task Management: 1688’s Agentic AI Evolution
Alimama Tech
Alimama Tech
Jun 4, 2026 · Artificial Intelligence

ICML 2026 Highlights: Five Taotian Group Papers Pushing Multimodal AI Boundaries

The article showcases five ICML 2026 papers from the Taotian Group that tackle core multimodal AI challenges—interactive video try‑on, high‑resolution vision, e‑commerce video reasoning, sparse‑reward reinforcement learning, and curriculum learning for large language models—detailing their problem statements, novel solutions, and strong experimental results.

ICML 2026Multimodal AIbenchmark
0 likes · 15 min read
ICML 2026 Highlights: Five Taotian Group Papers Pushing Multimodal AI Boundaries
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 1, 2026 · Artificial Intelligence

MetaAgent-X Enables Agents to Self‑Evolve: A New Paradigm for Native Collaboration

MetaAgent‑X integrates system design and execution within a single base model, using hierarchical rollout and stagewise co‑evolution to jointly train Designer and Executor roles, and achieves significant gains over single‑agent and prior multi‑agent baselines on math and code benchmarks.

AI collaborationMetaAgent-XMulti-Agent Systems
0 likes · 13 min read
MetaAgent-X Enables Agents to Self‑Evolve: A New Paradigm for Native Collaboration
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 31, 2026 · Artificial Intelligence

MetaAgent-X Enables Self‑Evolving Agents for Native Collaboration

MetaAgent-X tackles the limitation of fixed‑executor multi‑agent systems by jointly training a Designer that creates lightweight Python‑based collaboration scripts and an Executor that runs them, using hierarchical rollouts and stagewise co‑evolution to improve both design and execution across math and code benchmarks.

LLMMetaAgent-XMulti-Agent Systems
0 likes · 13 min read
MetaAgent-X Enables Self‑Evolving Agents for Native Collaboration
Data Party THU
Data Party THU
May 31, 2026 · Artificial Intelligence

Reinforcement Learning Launches a New Paradigm for Spatial Omics Experiment Design

A reinforcement‑learning framework called SOFisher, developed by teams from Fudan and Beijing Institute of Technology, enables intelligent, adaptive selection of field‑of‑view positions in costly spatial‑omics experiments, dramatically improving target detection efficiency and revealing disease‑relevant cellular niches with far fewer measurements.

AI-driven microscopyAlzheimer's diseaseSOFisher
0 likes · 7 min read
Reinforcement Learning Launches a New Paradigm for Spatial Omics Experiment Design
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 30, 2026 · Artificial Intelligence

Breaking the Agent Training Bottleneck: Open‑Source ClawGym Data, Training, and Evaluation Pipeline

ClawGym provides a complete open‑source framework for Claw‑style personal agents, linking a 13.5 K synthetic task dataset, black‑box rollout training, sandbox‑parallel reinforcement learning, and a rigorously verified benchmark of 200 tasks, and demonstrates that synthetic data can lift a 30 B model beyond a 235 B baseline.

ClawGymOpenClawagent training
0 likes · 16 min read
Breaking the Agent Training Bottleneck: Open‑Source ClawGym Data, Training, and Evaluation Pipeline
Machine Heart
Machine Heart
May 30, 2026 · Artificial Intelligence

How Abstract Symbols Cut AI Inference Cost by 11×

The article examines IBM Research's Abstract‑CoT approach, which replaces verbose natural‑language chain‑of‑thought reasoning with a compact abstract token vocabulary, achieving up to an 11‑fold reduction in inference tokens while maintaining comparable accuracy across math, instruction‑following, and multi‑hop QA benchmarks.

AI inferenceAbstract-CoTChain-of-Thought
0 likes · 11 min read
How Abstract Symbols Cut AI Inference Cost by 11×
AI Engineering
AI Engineering
May 30, 2026 · Artificial Intelligence

A Unified Toolbox for JEPA and World Model Research: stable-worldmodel

Researchers tackling world‑model problems often rebuild data pipelines, environments, and baselines from scratch, but the open‑source stable‑worldmodel platform consolidates diverse dataset formats, SOTA baselines, hundreds of environments, and multiple solvers, offering a three‑step workflow with demonstrated storage and speed advantages.

JEPALanceDBdatasets
0 likes · 4 min read
A Unified Toolbox for JEPA and World Model Research: stable-worldmodel
SuanNi
SuanNi
May 29, 2026 · Artificial Intelligence

SenseNova-U1-8B-MoT-Infographic: Academic Charts, Posters, Recipes

The SenseNova-U1-8B-MoT-Infographic model dramatically improves AI‑generated infographics by enhancing dense‑text rendering, layout stability, and chart accuracy through targeted data, extended mid‑training, and reinforcement‑learning fine‑tuning, achieving top scores on BizGenEval and IGenBench and surpassing many commercial rivals.

AI modelMultimodalSenseNova
0 likes · 9 min read
SenseNova-U1-8B-MoT-Infographic: Academic Charts, Posters, Recipes
Old Zhang's AI Learning
Old Zhang's AI Learning
May 29, 2026 · Artificial Intelligence

How NVIDIA’s Polar Enables Any Agent Framework to Plug Into Reinforcement Learning

Integrating diverse AI agent harnesses into reinforcement‑learning pipelines is notoriously labor‑intensive, but NVIDIA’s new Polar system inserts an API‑proxy layer that treats any harness as a black box, enabling seamless rollout recording and trajectory reconstruction, as demonstrated by dramatic performance gains on a 4B model across multiple harnesses.

AI AgentAPI ProxyNVIDIA
0 likes · 10 min read
How NVIDIA’s Polar Enables Any Agent Framework to Plug Into Reinforcement Learning
Bighead's Algorithm Notes
Bighead's Algorithm Notes
May 29, 2026 · Artificial Intelligence

AlphaCFG: Grammar‑Guided, Interpretable Alpha‑Factor Discovery Framework

AlphaCFG introduces a grammar‑based framework that defines a controllable search space for discovering syntactically valid, financially interpretable alpha factors, using syntax‑aware Monte‑Carlo tree search guided by value and policy networks, and demonstrates superior search efficiency and profitability on Chinese and US stock datasets.

Alpha FactorGrammar Guided SearchMonte Carlo Tree Search
0 likes · 17 min read
AlphaCFG: Grammar‑Guided, Interpretable Alpha‑Factor Discovery Framework
Machine Heart
Machine Heart
May 29, 2026 · Artificial Intelligence

DiffusionOPD: A New Online Policy Distillation Paradigm for Multi‑Task Diffusion Models

DiffusionOPD introduces a unified on‑policy distillation framework for diffusion models that decouples single‑task online policy exploration from multi‑task capability integration, training expert teachers per task and distilling their skills into a single student model, achieving faster convergence and higher performance across composition, OCR, and aesthetic tasks.

Diffusion ModelsKL divergenceMulti-Task Learning
0 likes · 8 min read
DiffusionOPD: A New Online Policy Distillation Paradigm for Multi‑Task Diffusion Models
SuanNi
SuanNi
May 28, 2026 · Artificial Intelligence

How a 3.8B Model Beats 6B+ Models Using Just 20% of the Compute – Inside Microsoft Lens

Microsoft’s Lens team shows that a 3.8 B‑parameter image‑generation model can match or surpass 6 B‑plus models while consuming only about 19 % of the GPU compute, thanks to aggressive model compression, dense captioning, mixed‑resolution training, optimized VAE and language encoders, and targeted RL fine‑tuning.

BenchmarkingModel Efficiencydense captioning
0 likes · 14 min read
How a 3.8B Model Beats 6B+ Models Using Just 20% of the Compute – Inside Microsoft Lens
Alimama Tech
Alimama Tech
May 28, 2026 · Artificial Intelligence

TAR: Multi‑Scale Trajectory Model Fixes Granularity Mismatch, Raising CTR >12%

The paper introduces the Trajectory Auto‑Regressive (TAR) model, which uses multi‑scale trajectory generation, a VQ‑VAE latent compression, and a state‑action fusion architecture to address granularity mismatch between fine‑grained decision steps and coarse‑grained feedback in online advertising, achieving over 12% CTR lift, smoother budget pacing, and faster inference compared to prior baselines.

Budget PacingMulti-Scale GenerationOnline Advertising
0 likes · 18 min read
TAR: Multi‑Scale Trajectory Model Fixes Granularity Mismatch, Raising CTR >12%
HyperAI Super Neural
HyperAI Super Neural
May 28, 2026 · Artificial Intelligence

Large-Model RL Advances: Credit Allocation, Complex Reasoning, Agent Learning

HyperAI curates six cutting‑edge large‑model reinforcement‑learning papers—from ECHO’s free world‑model learning to DelTA’s discriminative token credit, GoLongRL’s capability‑oriented long‑context RL, Anti‑SD’s reverse distillation, RubricEM’s rubric‑guided policy decomposition, and Poly‑EPO’s diversity‑driven exploration—highlighting their methods, benchmarks, and performance gains.

Agent LearningComplex ReasoningCredit Assignment
0 likes · 10 min read
Large-Model RL Advances: Credit Allocation, Complex Reasoning, Agent Learning
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 28, 2026 · Artificial Intelligence

Open‑Source 35B Intern‑S2‑Preview Rivals Trillion‑Parameter Models on Scientific Benchmarks

The open‑source 35‑billion‑parameter Intern‑S2‑Preview model achieves scientific‑task performance comparable to trillion‑parameter models, thanks to full‑link “general‑specialized” training, reinforced‑learning scaling, and hardware‑aware optimizations, and it outperforms leading closed‑source models on benchmarks such as MolecularIQ and crystal‑structure generation.

InternLMLarge Language ModelScientific AI
0 likes · 11 min read
Open‑Source 35B Intern‑S2‑Preview Rivals Trillion‑Parameter Models on Scientific Benchmarks
Data Party THU
Data Party THU
May 27, 2026 · Artificial Intelligence

How Bengio’s TBA Decouples Sampling and Learning to Speed Up LLM RL by 50×

The article explains how large‑language‑model post‑training suffers from rollout bottlenecks, introduces the Trajectory Balance with Asynchrony (TBA) framework that separates a Searcher from a Trainer, reuses off‑policy trajectories via a Trajectory Balance objective, and demonstrates up to 50× speed‑ups while preserving or improving performance on math reasoning, preference fine‑tuning, and automated red‑team tasks.

Asynchronous TrainingLLMOff-Policy
0 likes · 9 min read
How Bengio’s TBA Decouples Sampling and Learning to Speed Up LLM RL by 50×
SuanNi
SuanNi
May 26, 2026 · Artificial Intelligence

Why Tokens Are Burning Out and a Free Claude Opus 4.6‑Level Model Is Coming

The SkyClaw‑v1.0 model from Skywork AI offers a free, soon‑to‑be open‑source large‑language model for agent applications that matches Claude Opus 4.6 in performance while cutting token costs dramatically, and the article details its benchmarks, training pipeline, and deployment recommendations.

AgentLarge Language ModelOpenAI API
0 likes · 7 min read
Why Tokens Are Burning Out and a Free Claude Opus 4.6‑Level Model Is Coming
Machine Heart
Machine Heart
May 23, 2026 · Artificial Intelligence

Why Can’t LLMs Directly Copy AlphaGo’s MCTS Success?

The article analyzes why large language models cannot simply adopt AlphaGo’s Monte‑Carlo Tree Search, highlighting credit‑assignment difficulties, gradient‑variance explosion in multi‑step RL, and how AlphaGo’s tight integration of value and policy networks amortizes search in a way LLMs cannot replicate.

AlphaGoCredit AssignmentLLM
0 likes · 6 min read
Why Can’t LLMs Directly Copy AlphaGo’s MCTS Success?
Machine Heart
Machine Heart
May 22, 2026 · Artificial Intelligence

ATLAS: One Word Unifies Agentic and Latent Visual Reasoning

ATLAS introduces a discrete functional token that simultaneously serves as an agentic operation and a latent reasoning unit, enabling large multimodal models to perform visual tasks without external tools or intermediate image generation, and achieves competitive results through SFT‑plus‑RL training and a token‑level gradient‑anchor technique.

ATLASMultimodal AIVisual Reasoning
0 likes · 11 min read
ATLAS: One Word Unifies Agentic and Latent Visual Reasoning
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 21, 2026 · Artificial Intelligence

Breaking the UED Bottleneck: PACE Locates the Reinforcement‑Learning Zone of Proximal Development

The paper introduces PACE, a Parameter‑Change based Unsupervised Environment Design method that evaluates training levels by the magnitude of induced policy‑parameter updates, offering a low‑variance, computationally cheap signal that consistently outperforms prior UED approaches on MiniGrid and Craftax benchmarks.

CraftaxICML 2026MiniGrid
0 likes · 11 min read
Breaking the UED Bottleneck: PACE Locates the Reinforcement‑Learning Zone of Proximal Development
Machine Heart
Machine Heart
May 21, 2026 · Artificial Intelligence

Breaking the Traditional UED Bottleneck: Using RL to Precisely Locate the Zone of Proximal Development

The paper introduces PACE, a Parameter Change Environment Design method that evaluates training levels by measuring induced policy parameter updates, offering a low‑variance learning‑progress signal that outperforms prior UED approaches on MiniGrid and Craftax benchmarks, achieving higher success rates and more stable generalization.

CraftaxICML 2026MiniGrid
0 likes · 10 min read
Breaking the Traditional UED Bottleneck: Using RL to Precisely Locate the Zone of Proximal Development
Old Zhang's AI Learning
Old Zhang's AI Learning
May 21, 2026 · Artificial Intelligence

SkillOS: Enabling Agents to Self‑Manage Their Skills

SkillOS reframes skill management for LLM agents as a long‑horizon reinforcement‑learning problem, letting a trainable Skill Curator automatically insert, update, or delete markdown‑based skills, which the frozen Agent Executor then consumes, improving memory‑free performance and cross‑task transfer.

LLM AgentsMarkdownSelf-Evolving Agents
0 likes · 6 min read
SkillOS: Enabling Agents to Self‑Manage Their Skills
Machine Heart
Machine Heart
May 21, 2026 · Artificial Intelligence

OneModel 1.7 Hits 99% LIBERO Success, Bridging ‘Seeing’ to ‘Doing’ with Implicit Predictive Policy

OneModel 1.7 FrontoStria‑RL achieves a 99% average success rate on the LIBERO benchmark, surpassing π0.5, GR00T‑N1.5 and OpenVLA‑OFT, by introducing a Predictive Policy Latent that implicitly links world‑model understanding to action execution and is continuously refined through a reinforcement‑learning loop and a Retrieve‑then‑Steer memory mechanism.

Embodied AILIBERO BenchmarkPredictive Policy Latent
0 likes · 15 min read
OneModel 1.7 Hits 99% LIBERO Success, Bridging ‘Seeing’ to ‘Doing’ with Implicit Predictive Policy
Data Party THU
Data Party THU
May 21, 2026 · Artificial Intelligence

ICML 2026: MedScope Introduces a New Paradigm for Long Medical Video Reasoning—From Watching to Verifying

MedScope proposes a "Think with Videos" paradigm that lets AI models actively locate and verify evidence in long clinical videos, using coarse‑to‑fine tool calling, evidence‑centric training data (ClinVideoSuite) and a grounding‑aware reinforcement learning objective, achieving superior performance on multiple video‑understanding benchmarks.

Evidence-based QALong Video ReasoningMedical Video AI
0 likes · 10 min read
ICML 2026: MedScope Introduces a New Paradigm for Long Medical Video Reasoning—From Watching to Verifying
PaperAgent
PaperAgent
May 21, 2026 · Artificial Intelligence

238 Promising Reinforcement‑Learning Ideas Likely to Earn CCF‑A Papers in 2026

The article compiles 238 cutting‑edge reinforcement‑learning ideas across 21 research directions, highlights recent breakthroughs such as Sutton’s Intentional Updates, and provides brief overviews of representative papers—including knowledge‑graph, Kalman‑filter, agentic, LLM‑driven, and world‑model approaches—along with links to the accompanying source code.

Agentic RLKalman filterKnowledge Graph
0 likes · 6 min read
238 Promising Reinforcement‑Learning Ideas Likely to Earn CCF‑A Papers in 2026
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 20, 2026 · Artificial Intelligence

Composer 2.5 Narrows the Gap to Claude Opus 4.7 with Ten‑Fold Cost Savings

Composer 2.5, the latest AI‑coding model from Cursor, claims near‑par performance with Claude 4.7 Opus and GPT‑5.5 while delivering up to ten‑times higher efficiency and a pricing model of $0.5 per M input tokens and $2.5 per M output tokens, backed by novel reinforcement‑learning tricks, massive synthetic data, and a custom Muon optimizer with dual‑grid HSDP architecture.

AI programmingComposer 2.5HSDP
0 likes · 13 min read
Composer 2.5 Narrows the Gap to Claude Opus 4.7 with Ten‑Fold Cost Savings
DeepHub IMBA
DeepHub IMBA
May 19, 2026 · Artificial Intelligence

A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL

The article reviews five years of LLM‑centric reinforcement learning, tracing the evolution from early Q‑learning to PPO, then to Direct Preference Optimization, Group Relative Policy Optimization, and finally multi‑agent RL, detailing each method’s mechanics, strengths, failure modes, practical considerations, and emerging open‑source toolchains.

DPOGRPOLLM alignment
0 likes · 33 min read
A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL
Machine Heart
Machine Heart
May 19, 2026 · Artificial Intelligence

HyperEyes: Parallel Multimodal Search Agents Move from Deep to Wide for Efficiency

HyperEyes introduces a unified‑location‑as‑search (UGS) action space, parallel data synthesis, and a dual‑granularity efficiency‑aware RL framework that enable multimodal agents to perform simultaneous multi‑target retrieval, dramatically reducing interaction rounds while improving accuracy and cost‑efficiency across benchmark evaluations.

AgentEfficiencybenchmark
0 likes · 9 min read
HyperEyes: Parallel Multimodal Search Agents Move from Deep to Wide for Efficiency
Machine Heart
Machine Heart
May 19, 2026 · Artificial Intelligence

100k‑Token Natural‑Language Reasoning Enables a 30B‑A3B Model to Reach Olympiad Gold Level

A 30B‑A3B model, trained with reverse‑perplexity supervised fine‑tuning, two‑stage reinforcement learning, and a multi‑round generate‑verify‑revise inference loop, achieves gold‑medal performance on IMO, USAMO and IPhO contests using over 100 k token natural‑language reasoning without external tools.

30B-A3Bnatural language processingolympiad AI
0 likes · 11 min read
100k‑Token Natural‑Language Reasoning Enables a 30B‑A3B Model to Reach Olympiad Gold Level
ByteDance SE Lab
ByteDance SE Lab
May 19, 2026 · Artificial Intelligence

Introducing Uni-Agent: veRL’s Open‑Source Unified Framework for General‑Purpose Agent Training

Uni-Agent is an open‑source framework that unifies building, running, and training of general AI agents, offering extensible model, tool, and environment modules, scalable sandbox execution via veFaaS, live monitoring, and demonstrated performance gains on large‑scale coding‑agent experiments.

AgentScalable ExecutionUnified Framework
0 likes · 8 min read
Introducing Uni-Agent: veRL’s Open‑Source Unified Framework for General‑Purpose Agent Training
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 19, 2026 · Artificial Intelligence

From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning

The paper introduces PreRL, which removes the input condition to directly optimize the reasoning trajectory (P(y)) of large language models, and combines it with standard RL in Dual Space RL (DSRL), achieving consistent gains on math and out‑of‑distribution benchmarks, faster training, and richer reasoning behaviors.

DSRLMath BenchmarksPreRL
0 likes · 11 min read
From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning
Machine Heart
Machine Heart
May 18, 2026 · Artificial Intelligence

Composer 2.5 Delivers Opus‑level Performance at One‑Tenth the Cost

Composer 2.5, Cursor’s latest LLM, matches Claude Opus 4.7‑level capabilities while costing roughly one‑tenth as much, thanks to larger training scale, precise text‑feedback reinforcement learning, 25× more synthetic tasks, and a new Muon‑HSDP optimizer that boosts efficiency up to ten‑fold.

Composer 2.5LLMMuon optimizer
0 likes · 9 min read
Composer 2.5 Delivers Opus‑level Performance at One‑Tenth the Cost
Bighead's Algorithm Notes
Bighead's Algorithm Notes
May 18, 2026 · Artificial Intelligence

FineFT: Efficient Risk-Aware Reinforcement Learning for Futures Trading

FineFT introduces a three‑stage ensemble reinforcement‑learning framework that tackles high‑leverage reward volatility and missing ability‑boundary awareness in crypto futures trading by using selective TD‑error updates, VAE‑based market‑state boundary detection, and a risk‑aware routing mechanism, ultimately outperforming twelve baselines on six financial metrics while cutting risk by over 40%.

Variational Autoencoderensemble methodsfinancial RL
0 likes · 12 min read
FineFT: Efficient Risk-Aware Reinforcement Learning for Futures Trading
Machine Heart
Machine Heart
May 18, 2026 · Artificial Intelligence

ICML 2026: Teaching Large Models to Think and Speak – Turning “When to Speak” into a Learnable Strategy

The paper “When to Think, When to Speak” introduces Side‑by‑Side Interleaved Reasoning, a learnable disclosure policy that lets LLMs alternate between internal thinking and user‑visible answer fragments, reducing content latency while preserving or improving accuracy on math and scientific QA benchmarks.

CoTLLMQwen3
0 likes · 10 min read
ICML 2026: Teaching Large Models to Think and Speak – Turning “When to Speak” into a Learnable Strategy
Machine Heart
Machine Heart
May 17, 2026 · Artificial Intelligence

What Exactly Is a World Model? History, Technology, and the $10 B Bet

The article traces the two decades‑long, parallel research lines that birthed video world models—dreaming agents in reinforcement learning and learning physics from human video—explains how they converged in 2024‑2025, evaluates current capabilities and limitations, and analyzes the $10 billion investment landscape and strategic moves by NVIDIA, OpenAI, and others.

AI researchSimulationreinforcement learning
0 likes · 32 min read
What Exactly Is a World Model? History, Technology, and the $10 B Bet
Machine Heart
Machine Heart
May 16, 2026 · Artificial Intelligence

GIPO: Overcoming Utilization Collapse for Efficient Large‑Model Reinforcement Learning

GIPO (Gaussian Importance Sampling Policy Optimization) replaces PPO’s hard clipping with a smooth Gaussian‑weighted trust region, achieving log‑space symmetry and bias‑variance balance that mitigates policy lag and utilization collapse, and demonstrates superior stability and sample efficiency on GridWorld, LIBERO, MetaWorld, and 7‑billion‑parameter VLA experiments.

Bias-Variance TradeoffGIPOLarge‑Scale Training
0 likes · 17 min read
GIPO: Overcoming Utilization Collapse for Efficient Large‑Model Reinforcement Learning
Machine Heart
Machine Heart
May 16, 2026 · Artificial Intelligence

Why More Compute Can't Fix LLM Inference Lag and Why RL Leads to Overtraining

In a deep interview, former Google TPU architect Reiner Pope explains that low‑concurrency fast‑mode services trade higher fees for faster streaming but are limited by memory‑bandwidth bottlenecks, that optimal concurrency balances compute and memory costs, and that pipeline‑parallel sparse expert models and reinforcement‑learning fine‑tuning introduce new inefficiencies and overtraining risks.

LLMMemory BandwidthOvertraining
0 likes · 7 min read
Why More Compute Can't Fix LLM Inference Lag and Why RL Leads to Overtraining
Machine Heart
Machine Heart
May 14, 2026 · Artificial Intelligence

Breaking Homogeneous Reasoning: I²B‑LPO Guides RLVR from Repeated Sampling to Effective Exploration

I²B‑LPO is an exploration‑enhancement framework for RLVR that branches rollouts at high‑entropy nodes, injects latent variables via pseudo self‑attention, and filters paths with an information‑bottleneck self‑reward, achieving up to 5.3% accuracy and 7.4% diversity improvements on multiple math reasoning benchmarks.

RLVRentropyexploration
0 likes · 14 min read
Breaking Homogeneous Reasoning: I²B‑LPO Guides RLVR from Repeated Sampling to Effective Exploration
Machine Heart
Machine Heart
May 14, 2026 · Artificial Intelligence

How PsiBot Uses 100,000 Hours of Human Data to Power Embodied Intelligence

PsiBot demonstrates that, with a 100,000‑hour human‑operation dataset captured via exoskeleton gloves and ego‑vision, a world‑model (W0) and reinforcement‑learning policy (R2) can bridge the gap to robot control, offering a scalable alternative to costly teleoperation pipelines.

Embodied AIdata collectionhuman data
0 likes · 12 min read
How PsiBot Uses 100,000 Hours of Human Data to Power Embodied Intelligence
Kuaishou Tech
Kuaishou Tech
May 13, 2026 · Artificial Intelligence

OneSearch‑V2 Launches: Self‑Distilled Generative Search That Truly Understands Users

OneSearch‑V2 introduces a latent‑reasoning enhanced self‑distillation framework that augments query understanding with thought‑augmented CoT, aligns preferences via direct user behavior feedback, and achieves up to 4 % CTR lift and significant order growth without adding inference cost or latency.

LLMSelf‑Distillationbehavioral feedback
0 likes · 26 min read
OneSearch‑V2 Launches: Self‑Distilled Generative Search That Truly Understands Users
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 12, 2026 · Artificial Intelligence

Breaking Off‑Policy Shift: Bengio’s TBA Decouples Sampling and Learning for 50× Faster LLM RL

Trajectory Balance with Asynchrony (TBA) separates sample generation (Searcher) from model updates (Trainer), uses a trajectory‑balance objective to incorporate off‑policy data, and achieves up to 50× speedup in large‑model RL post‑training while preserving or improving performance on math reasoning, preference fine‑tuning, and red‑team tasks.

Asynchronous TrainingLLMOff-Policy
0 likes · 10 min read
Breaking Off‑Policy Shift: Bengio’s TBA Decouples Sampling and Learning for 50× Faster LLM RL
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 12, 2026 · Artificial Intelligence

LaST‑R1: Embodied Robot Model Hits 99.9% LIBERO Success via Physical Reasoning

LaST‑R1 presents a new embodied AI framework that inserts latent physical reasoning before action generation and jointly optimizes reasoning and control with LAPO, achieving 99.9% average success on the LIBERO benchmark after a single‑trajectory warm‑up and boosting real‑world task success from 52.5% to 93.75%, while showing superior generalization to unseen objects, backgrounds and lighting.

Embodied AILAPOLIBERO Benchmark
0 likes · 11 min read
LaST‑R1: Embodied Robot Model Hits 99.9% LIBERO Success via Physical Reasoning
Data Party THU
Data Party THU
May 12, 2026 · Artificial Intelligence

MathForge: Leveraging Hard Problems in RL to Boost Large‑Model Mathematical Reasoning (ICLR 2026)

MathForge tackles the long‑standing question of which math problems deserve focus in reinforcement‑learning‑based training, introducing a difficulty‑aware optimizer (DGPO) and multi‑aspect question reformulation (MQR) that together prioritize harder‑but‑learnable questions, yielding consistent performance gains across model sizes and modalities.

DGPODifficulty‑Aware OptimizationMQR
0 likes · 11 min read
MathForge: Leveraging Hard Problems in RL to Boost Large‑Model Mathematical Reasoning (ICLR 2026)
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
May 12, 2026 · Artificial Intelligence

Silicon Brain: Neural Connections, Symbolic Reasoning, and Reinforcement Learning in AGI

This article analyses DeepMind’s three‑pronged AGI paradigm—combining neural networks, symbolic systems, and reinforcement learning—by dissecting AlphaGo, AlphaFold 2, Gemini, and the Genie‑Sima loop, mapping the biological inspiration, outlining engineering and safety challenges, and proposing research directions for large‑scale deployment in communication scenarios.

AGIDeepMindEngineering Challenges
0 likes · 21 min read
Silicon Brain: Neural Connections, Symbolic Reasoning, and Reinforcement Learning in AGI
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 11, 2026 · Artificial Intelligence

Heuristic Learning: A New Reinforcement Learning Paradigm for Continual Learning

The article proposes Heuristic Learning (HL) as a way to tackle continual learning’s catastrophic forgetting by using coding agents that iteratively refine rule‑based policies, showing empirical gains on Atari, MuJoCo, and VizDoom tasks and outlining HL’s benefits, challenges, and future integration with neural networks.

Continual LearningLLMcoding agents
0 likes · 15 min read
Heuristic Learning: A New Reinforcement Learning Paradigm for Continual Learning
PaperAgent
PaperAgent
May 11, 2026 · Artificial Intelligence

SkillOS: How Skill Governance Powers Self‑Evolving AI Agents

SkillOS addresses the one‑off nature of current LLM agents by introducing a closed‑loop system where a trainable Skill Curator continuously extracts, updates, and manages reusable skills from execution traces, leading to measurable gains in success rates, efficiency, and cross‑task generalization.

Grouped Task StreamsLLM AgentsMeta-Strategy Skills
0 likes · 10 min read
SkillOS: How Skill Governance Powers Self‑Evolving AI Agents
Machine Heart
Machine Heart
May 10, 2026 · Artificial Intelligence

Embodied AI Unveiled: Ted Xiao Revisits Three Eras of Robot Learning from Google RT‑1/2 to SayCan

In a detailed interview, Ted Xiao, former Google DeepMind researcher, walks through the existence‑proof, foundation‑model, and scaling eras of embodied robot learning, explaining the technical challenges, pivotal decisions, and the evolving role of large language and vision models in robotics.

Embodied AIFoundation Modelsimitation learning
0 likes · 19 min read
Embodied AI Unveiled: Ted Xiao Revisits Three Eras of Robot Learning from Google RT‑1/2 to SayCan
DataFunTalk
DataFunTalk
May 10, 2026 · Artificial Intelligence

DeepSeek vs MCTS: Decoding the ‘Chicken & Liquor’ Dilemma in LLM Training

The article analyzes why DeepSeek’s large‑model training struggles with Monte‑Carlo Tree Search, explains its use of Chain‑of‑Thought prompting, GRPO entropy‑boosting and rejection‑sampling fine‑tuning, compares these methods with Google’s OmegaPRM and PRM approaches, and proposes a concrete MCTS‑driven data‑generation pipeline to overcome the “chicken and liquor” trade‑off.

Chain-of-ThoughtDeepSeekGRPO
0 likes · 14 min read
DeepSeek vs MCTS: Decoding the ‘Chicken & Liquor’ Dilemma in LLM Training
Machine Heart
Machine Heart
May 10, 2026 · Artificial Intelligence

Stop Fragmenting Long Texts: HiLight Lets AI Highlight Key Points Directly

The HiLight approach inserts lightweight highlight tags into full-length inputs, training a small Emphasis Actor to score token importance and guide a frozen large language model, improving performance on tasks like recommendation and QA without modifying the solver, while keeping low latency and training cost.

EvaluationLLMhighlighting
0 likes · 9 min read
Stop Fragmenting Long Texts: HiLight Lets AI Highlight Key Points Directly
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 9, 2026 · Artificial Intelligence

Heuristic Learning: Reinforcement Without Parameter Updates via .py File

OpenAI researcher Yong Jiayi introduces Heuristic Learning, a reinforcement paradigm that replaces gradient‑based neural network updates with code‑editing driven by GPT‑5.4, achieving the theoretical 864‑point Atari Breakout score and matching or surpassing PPO on multiple Atari and robot tasks.

Atari BenchmarkContinual LearningGPT-5.4
0 likes · 8 min read
Heuristic Learning: Reinforcement Without Parameter Updates via .py File
PaperAgent
PaperAgent
May 9, 2026 · Artificial Intelligence

How Anthropic’s Natural Language Autoencoders Open the LLM Black Box

Anthropic’s Natural Language Autoencoders (NLA) translate high‑dimensional LLM activation vectors into readable text, using an Activation Verbalizer and Reconstruction module trained via RL to maximize Fraction of Variance Explained, and reveal internal planning, language bias, tool‑call hallucinations, and hidden reasoning across multiple Claude models.

Activation VerbalizerAnthropicClaude
0 likes · 9 min read
How Anthropic’s Natural Language Autoencoders Open the LLM Black Box
DeepHub IMBA
DeepHub IMBA
May 8, 2026 · Artificial Intelligence

Building a Custom 8×8 GridWorld with Q‑Learning in Gymnasium

This tutorial walks through creating a custom 8×8 GridWorld environment in Gymnasium, implementing a Q‑Learning agent that learns to navigate from the top‑left corner to the bottom‑right goal while avoiding walls, and visualizing training curves, learned policies, and a performance comparison with a random agent.

GridWorldGymnasiumPython
0 likes · 10 min read
Building a Custom 8×8 GridWorld with Q‑Learning in Gymnasium
Machine Heart
Machine Heart
May 8, 2026 · Industry Insights

How SGLang’s $100M Seed Funding Powers the Next‑Gen Open AI Infrastructure

RadixArk raised a $100 million seed round backed by top hardware and AI investors to turn the open‑source SGLang inference engine and the Miles RL framework into day‑0 standards, aiming to democratize AI infrastructure and eliminate bottlenecks from training to inference.

AI InfrastructureDeepSeek-V4Hardware‑agnostic AI
0 likes · 10 min read
How SGLang’s $100M Seed Funding Powers the Next‑Gen Open AI Infrastructure
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 7, 2026 · Artificial Intelligence

Latent Action RL Shrinks Exploration Space for Multimodal Dialogue Fine‑Tuning

By learning a compact latent‑action space from paired image‑text and large‑scale text data, the authors reduce the RL search space from a vocabulary of over 150 k tokens to a 128‑codebook, enabling more efficient fine‑tuning of multimodal conversational agents and achieving consistent gains across several RL algorithms.

Multimodaldialogue agentslatent actions
0 likes · 11 min read
Latent Action RL Shrinks Exploration Space for Multimodal Dialogue Fine‑Tuning