Tagged articles

reinforcement learning

743 articles · Page 3 of 8
PaperAgent
PaperAgent
Feb 15, 2026 · Artificial Intelligence

Why Memory Is the Next Critical Infrastructure for AI Agents

This survey reviews over 200 papers to propose a three‑dimensional classification framework for foundation‑agent memory, analyzes paradigm shifts from model‑centric to utility‑centric AI, and outlines memory substrates, cognitive mechanisms, operation strategies, learning paradigms, evaluation metrics, applications, and future research directions.

AI agentsFoundation ModelsMemory Mechanisms
0 likes · 10 min read
Why Memory Is the Next Critical Infrastructure for AI Agents
Top Architect
Top Architect
Feb 14, 2026 · Artificial Intelligence

Why Test‑Time Compute Is the Next Breakthrough for Large Language Models

The article explains how inference‑oriented large language models shift the focus from training‑time resources to test‑time computation, detailing scaling laws, verification techniques, reinforcement‑learning pipelines such as DeepSeek‑R1, and methods for distilling reasoning abilities into smaller, consumer‑grade models.

Prompt engineeringinference computelarge language models
0 likes · 19 min read
Why Test‑Time Compute Is the Next Breakthrough for Large Language Models
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Feb 12, 2026 · Artificial Intelligence

Deploying GLM-5 on Baidu Kunlun P800 XPU with vLLM‑Kunlun Plugin

This article explains how Baidu's new GLM-5 large model is adapted to the Kunlun P800 XPU, detailing the async reinforcement learning framework Slime, optimization techniques like INT8 quantization and tensor‑parallelism, and provides step‑by‑step deployment commands using the open‑source vLLM‑Kunlun plugin.

AI accelerationGLM-5INT8 Quantization
0 likes · 6 min read
Deploying GLM-5 on Baidu Kunlun P800 XPU with vLLM‑Kunlun Plugin
DeWu Technology
DeWu Technology
Feb 11, 2026 · Artificial Intelligence

How Generative Models Transform Re‑ranking Architecture for Faster, More Diverse Recommendations

This article examines the evolution of re‑ranking systems from traditional pointwise models to a two‑stage generation‑evaluation framework, compares autoregressive and non‑autoregressive generative approaches, details inference speed optimizations with GPU and model‑server upgrades, and outlines a future end‑to‑end sequence generation architecture enhanced by reinforcement learning and contrastive learning.

AIInference OptimizationRecommendation Systems
0 likes · 14 min read
How Generative Models Transform Re‑ranking Architecture for Faster, More Diverse Recommendations
Ximalaya Technology Team
Ximalaya Technology Team
Feb 11, 2026 · Artificial Intelligence

How Ximalaya Used Generative AI to Revolutionize Audio Recommendations

This article details Ximalaya's journey from traditional multi‑stage recommendation pipelines to generative AI‑driven models, covering business challenges, architectural and model differences, phased deployments, knowledge distillation, semantic ID encoding, decoder‑only strategies, extensive offline and online evaluations, and future research directions.

Encoder-DecoderGenerative AIRecommendation Systems
0 likes · 24 min read
How Ximalaya Used Generative AI to Revolutionize Audio Recommendations
AI Frontier Lectures
AI Frontier Lectures
Feb 10, 2026 · Artificial Intelligence

Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens

FaithLens is an 8‑billion‑parameter model that surpasses GPT‑4.1 and other large models on twelve hallucination‑detection benchmarks while providing high‑quality natural‑language explanations, thanks to a novel data‑synthesis pipeline, three‑dimensional filtering, and rule‑based reinforcement learning.

Efficient InferenceLLM hallucinationexplainable AI
0 likes · 12 min read
Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens
AI Frontier Lectures
AI Frontier Lectures
Feb 6, 2026 · Artificial Intelligence

Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?

The paper introduces Mixture‑of‑Visual‑Thoughts (MoVT), a context‑adaptive reasoning paradigm that integrates pure‑text and visually‑grounded inference modes within a single model, and presents the two‑stage AdaVaR training framework with a novel AdaGRPO reinforcement‑learning algorithm to automatically select the optimal mode for each visual‑language task, achieving consistent gains across eight benchmarks and surpassing strong baselines including GPT‑4o.

AdaVaRMixture-of-Visual-ThoughtsVisual Reasoning
0 likes · 16 min read
Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?
HyperAI Super Neural
HyperAI Super Neural
Feb 6, 2026 · Artificial Intelligence

Latest Advances in AI Agents: PaperBanana, SDPO, Lumine, Idea2Story, and Insight Agents

This weekly roundup highlights five recent AI agent papers—PaperBanana for automated academic illustration, SDPO's self‑distillation reinforcement learning, Lumine's open‑world generalist agent, Idea2Story's pipeline for turning research ideas into narratives, and Insight Agents' fast e‑commerce insights—showcasing diverse breakthroughs in multi‑agent frameworks, self‑feedback learning, and real‑world deployment.

AI agentsMulti-Agent SystemsSelf‑Distillation
0 likes · 8 min read
Latest Advances in AI Agents: PaperBanana, SDPO, Lumine, Idea2Story, and Insight Agents
Alimama Tech
Alimama Tech
Feb 5, 2026 · Artificial Intelligence

Can Few-Shot Reinforcement Learning Supercharge Budget-Constrained Auto-Bidding?

This paper introduces ABPlanner, a few‑shot, context‑aware budget planner that enhances budget‑constrained auto‑bidding in online advertising by hierarchically allocating budgets across short‑term stages and training a sequential decision‑maker with deep reinforcement learning, achieving significant gains in simulated and real‑world A/B tests.

Online Advertisingauto-biddingbudget allocation
0 likes · 13 min read
Can Few-Shot Reinforcement Learning Supercharge Budget-Constrained Auto-Bidding?
Baobao Algorithm Notes
Baobao Algorithm Notes
Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM OptimizationLinear Attention
0 likes · 22 min read
Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks
Xiaomi Tech
Xiaomi Tech
Feb 3, 2026 · Artificial Intelligence

Xiaomi’s AI Research Secures Spots on ICLR 2026 – Papers and Key Findings

The International Conference on Learning Representations (ICLR) 2026 accepted multiple Xiaomi papers covering multimodal reasoning, reinforcement learning, GUI agents, autonomous driving, audio generation and benchmark design, each presenting novel frameworks, data‑centric training tricks and strong experimental results that advance the state of the art.

ICLR 2026Multimodal LearningXiaomi
0 likes · 17 min read
Xiaomi’s AI Research Secures Spots on ICLR 2026 – Papers and Key Findings
Baidu Geek Talk
Baidu Geek Talk
Feb 2, 2026 · Artificial Intelligence

How Cloud AI Infra Powers the Next Wave of Embodied Intelligence

This article outlines the rapid rise of embodied intelligence, the explosion of Vision‑Language‑Action (VLA) research, and how cloud‑based AI infrastructure—including multi‑level IaaS, data pipelines, dual‑system model designs, and reinforcement‑learning workflows—addresses emerging scaling and deployment challenges.

VLAmultimodal modelsreinforcement learning
0 likes · 13 min read
How Cloud AI Infra Powers the Next Wave of Embodied Intelligence
Data Party THU
Data Party THU
Jan 31, 2026 · Artificial Intelligence

Can LLMs Learn While Being Tested? Inside the TTT-Discover Breakthrough

The article examines the Test‑Time Training to Discover (TTT‑Discover) approach, which applies reinforcement learning during inference to let large language models continuously improve on single test problems, and reports strong results across mathematics, GPU kernel optimization, algorithm design, and biology.

AI researchLLMScientific Discovery
0 likes · 9 min read
Can LLMs Learn While Being Tested? Inside the TTT-Discover Breakthrough
JD Tech
JD Tech
Jan 31, 2026 · Artificial Intelligence

How JD's 9N‑LLM Engine Powers Scalable Generative Recommendation at Massive Scale

This article details JD Retail's 9N‑LLM unified training framework that tackles the massive data, hardware heterogeneity, and algorithmic challenges of generative recommendation by integrating TensorFlow and PyTorch, supporting GPU/NPU, and delivering high‑throughput sample processing, sparse/dense optimization, and flexible reinforcement‑learning capabilities.

GPU/NPURaylarge-scale AI
0 likes · 26 min read
How JD's 9N‑LLM Engine Powers Scalable Generative Recommendation at Massive Scale
JD Retail Technology
JD Retail Technology
Jan 30, 2026 · Artificial Intelligence

How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale

The article details JD Retail’s 9N‑LLM unified training engine—supporting TensorFlow and PyTorch, GPU and NPU, and both traditional and generative recommendation scenarios—explaining its architecture, high‑throughput sample engine, distributed sparse embedding system, five‑stage pipeline, UniAttention accelerator, and reinforcement‑learning capabilities that together enable TB‑scale data, B‑scale dense parameters, and efficient RL training for real‑world recommendation services.

GPU/NPUUniAttentiondistributed training
0 likes · 26 min read
How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale
DaTaobao Tech
DaTaobao Tech
Jan 30, 2026 · Artificial Intelligence

Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling

This article proposes an ASR‑driven pipeline that creates high‑quality AI‑reply vs. human‑like reply pairs, trains a rewrite model and a reward model, and uses GRPO reinforcement learning to generate natural, helpful, and less AI‑sounding responses in digital‑human live streaming, achieving 92% accuracy and 97% helpfulness while improving user experience.

ASR dataLLMQwen
0 likes · 20 min read
Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling
AI Engineering
AI Engineering
Jan 30, 2026 · Artificial Intelligence

Why Letting LLMs Argue Improves Their Reasoning Quality

Google’s recent study of over 8,000 reasoning tasks shows that advanced LLMs like DeepSeek‑R1 spontaneously develop multiple internal “expert” personas that debate, and that activating a discovered “social switch” dramatically raises accuracy, revealing that engineered conflict can enhance AI reasoning.

AI debateFeature ControlLLM
0 likes · 8 min read
Why Letting LLMs Argue Improves Their Reasoning Quality
PaperAgent
PaperAgent
Jan 30, 2026 · Artificial Intelligence

How LLM‑in‑Sandbox Turns Large Models into General‑Purpose Agents Without Extra Training

The LLM‑in‑Sandbox framework places large language models inside a virtual machine that provides external tool access, persistent storage, and code execution, yielding up to a 24.2% performance boost across six benchmark tasks without additional training, and it scales from zero‑shot to reinforcement‑learning‑enhanced agents while remaining cost‑effective.

Agentic AIEfficiencyLLM
0 likes · 6 min read
How LLM‑in‑Sandbox Turns Large Models into General‑Purpose Agents Without Extra Training
Meituan Technology Team
Meituan Technology Team
Jan 29, 2026 · Artificial Intelligence

How LongCat‑Flash‑Thinking‑2601 Achieves Real‑World Generalization for Agents

LongCat‑Flash‑Thinking‑2601, a 560‑billion‑parameter MoE model, combines environment expansion, multi‑environment RL, systematic noise training, a heavy‑thinking reasoning mode, and Zigzag sparse attention to deliver strong benchmark performance and robust real‑world agent capabilities.

Environment ExpansionLarge Language ModelZigzag Attention
0 likes · 14 min read
How LongCat‑Flash‑Thinking‑2601 Achieves Real‑World Generalization for Agents
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 28, 2026 · Artificial Intelligence

How We Built a High‑Performance AI Rental Advisor with One‑Model Tool‑Use and Reinforcement Learning

This article details the design, challenges, and performance gains of an AI‑driven rental recommendation system that replaces a multi‑agent architecture with a single LLM using dynamic tool‑use, introduces a two‑stage reinforcement‑learning pipeline, and achieves sub‑second latency and higher accuracy for complex rental scenarios.

AI recommendationLarge Language ModelTool Use
0 likes · 19 min read
How We Built a High‑Performance AI Rental Advisor with One‑Model Tool‑Use and Reinforcement Learning
PaperAgent
PaperAgent
Jan 25, 2026 · Artificial Intelligence

How Deep GraphRAG Solves Retrieval’s Three‑Way Dilemma with Hierarchical Search

Deep GraphRAG tackles the three‑fold dilemma of traditional Retrieval‑Augmented Generation by introducing hierarchical global‑to‑local retrieval, a beam‑search dynamic reordering that cuts latency, and a DW‑GRPO reinforcement‑learning module that adaptively weights rewards, achieving near‑state‑of‑the‑art performance with up to 86% faster inference.

AI researchGraphRAGHierarchical Retrieval
0 likes · 5 min read
How Deep GraphRAG Solves Retrieval’s Three‑Way Dilemma with Hierarchical Search
Meituan Technology Team
Meituan Technology Team
Jan 23, 2026 · Artificial Intelligence

How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning

EvoCUA, a native computer‑use agent from Meituan, combines a verifiable data‑synthesis engine, a ten‑thousand‑level sandbox infrastructure, and an experience‑driven learning paradigm to overcome data‑scaling and feedback challenges, achieving a 56.7% success rate on the OSWorld benchmark and surpassing previous open‑source models.

AI AgentComputer UseData Synthesis
0 likes · 27 min read
How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning
Amazon Cloud Developers
Amazon Cloud Developers
Jan 22, 2026 · Artificial Intelligence

How Amazon’s New Bedrock and SageMaker Features Cut AI Agent Costs and Speed Up Customization

The article explains how Amazon Bedrock’s reinforced fine‑tuning and SageMaker AI’s new serverless model‑customization dramatically lower the cost and latency of AI agents, delivering up to a 73% accuracy boost and shrinking model‑building cycles from months to days for enterprises of any size.

AI AgentAmazon BedrockAmazon SageMaker
0 likes · 9 min read
How Amazon’s New Bedrock and SageMaker Features Cut AI Agent Costs and Speed Up Customization
Tencent Advertising Technology
Tencent Advertising Technology
Jan 22, 2026 · Artificial Intelligence

How Tencent’s Bidding Algorithms Evolved from GMPC to GRB: A Deep Dive into Generative RL for Ads

The article reviews the 2025 evolution of Tencent advertising’s bidding system—from the second‑generation GMPC control algorithm through the third‑generation MRB reinforcement‑learning model to the fourth‑generation generative RL GRB—detailing architectural upgrades, multi‑channel modeling, training pipelines, and experimental gains, and outlines the 2026 AI‑agent roadmap.

Advertisingalgorithmbidding
0 likes · 15 min read
How Tencent’s Bidding Algorithms Evolved from GMPC to GRB: A Deep Dive into Generative RL for Ads
Tencent Cloud Developer
Tencent Cloud Developer
Jan 20, 2026 · Artificial Intelligence

From Transformers to Agents: A Complete Timeline of Large Language Model Evolution

This article traces the evolution of large language models from the 2017 Transformer breakthrough through successive milestones such as BERT, GPT‑3, RL‑HF alignment, multimodal extensions, open‑source alternatives, and the rise of retrieval‑augmented generation, AI agents, and emerging protocols that shape modern AI applications.

Prompt engineeringRAGlarge language models
0 likes · 44 min read
From Transformers to Agents: A Complete Timeline of Large Language Model Evolution
PaperAgent
PaperAgent
Jan 19, 2026 · Artificial Intelligence

How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions

Recent research shows that applying reinforcement learning to large language models can dramatically improve inference performance, but its effectiveness depends on the token distribution produced during pre‑training, prompting a novel rewrite of cross‑entropy as a single‑step policy gradient with controllable entropy parameters.

LLMModel OptimizationRL
0 likes · 6 min read
How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions
PaperAgent
PaperAgent
Jan 16, 2026 · Artificial Intelligence

How a 4B Model Beats 30B Giants: Inside AgentCPM-Explore’s SOTA Performance

AgentCPM-Explore, a 4‑billion‑parameter open‑source model, achieves state‑of‑the‑art results on long‑range exploration tasks, matching or surpassing larger 8B and even 30B models, thanks to a full‑stack infrastructure, novel training tricks, and extensive benchmark evaluations across eight agent‑centric datasets.

AgentAgentCPM-ExploreLarge Language Model
0 likes · 10 min read
How a 4B Model Beats 30B Giants: Inside AgentCPM-Explore’s SOTA Performance
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jan 15, 2026 · Information Security

How Hi-Guard Improves Trustworthy Multimodal Content Moderation with Policy‑Aligned Reasoning

The Hi-Guard framework transforms content moderation by aligning multimodal models with policy rules through hierarchical prompting, a structured taxonomy, and soft‑margin reinforcement learning, achieving significant gains in accuracy, precision, recall, and explainability for large‑scale user‑generated content platforms.

Multimodal AIcontent moderationexplainability
0 likes · 9 min read
How Hi-Guard Improves Trustworthy Multimodal Content Moderation with Policy‑Aligned Reasoning
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Jan 11, 2026 · Artificial Intelligence

FinRpt: A Multi‑Agent Framework for Automatic Generation and Evaluation of Stock Research Reports

FinRpt introduces a novel multi‑agent pipeline that builds a high‑quality stock research report (ERR) dataset from six financial data sources, defines a comprehensive 11‑metric evaluation suite, and demonstrates that supervised‑fine‑tuned and reinforcement‑learned LLM agents significantly outperform single LLM baselines in both accuracy and efficiency.

FinRptFinancial NLPLLM
0 likes · 14 min read
FinRpt: A Multi‑Agent Framework for Automatic Generation and Evaluation of Stock Research Reports
AI Engineering
AI Engineering
Jan 10, 2026 · Artificial Intelligence

Teaching LLMs to Manage Memory Autonomously, Dropping Manual Rules

Alibaba's new AgeMem framework turns long‑term and short‑term memory management for large language model agents into a learnable reinforcement‑learning task, replacing handcrafted rules with a three‑stage training process and achieving significant benchmark gains.

AgeMemGRPOLLM
0 likes · 9 min read
Teaching LLMs to Manage Memory Autonomously, Dropping Manual Rules
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Jan 8, 2026 · Artificial Intelligence

Alpha‑R1: Reinforcement‑Learning‑Driven Large‑Model Alpha Factor Selection

Alpha‑R1 integrates reinforcement learning with an 8‑billion‑parameter LLM to jointly process price and news data, creating context‑aware factor embeddings that outperform traditional quantitative and generic LLM baselines on CSI 300 and CSI 1000 portfolios, demonstrating robust alpha‑decay resistance and zero‑sample generalization.

Large Language Modelalpha factor selectionfinancial AI
0 likes · 16 min read
Alpha‑R1: Reinforcement‑Learning‑Driven Large‑Model Alpha Factor Selection
Amap Tech
Amap Tech
Jan 8, 2026 · Artificial Intelligence

How AI Powers Fancy Video Generation for Real‑World POI Scenes

This article details the AI techniques behind Gaode's "Street Ranking" project, explaining the Fancy video concept, the dual training and production pipelines, and the use of SFT, reinforcement learning, MoE‑LoRA, distribution‑matching distillation, and quality‑filtering to achieve 25× faster generation with high aesthetic fidelity.

AI video generationDistillationMultimodal
0 likes · 24 min read
How AI Powers Fancy Video Generation for Real‑World POI Scenes
Tencent Advertising Technology
Tencent Advertising Technology
Jan 8, 2026 · Artificial Intelligence

How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking

Tencent's ad tech team redesigned its ad ranking system by adding a parallel user‑experience‑optimized pipeline and evolving from manual CEM tuning to DDPG‑based reinforcement learning, achieving 10‑20% improvements in CTR, repeat‑view rates, and other experience metrics while maintaining overall spend.

AdvertisingRankingmulti-objective optimization
0 likes · 17 min read
How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking
Data Party THU
Data Party THU
Jan 7, 2026 · Artificial Intelligence

Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

A recent study reveals that the widely used KL regularization in LLM reinforcement learning (RLVR) is mathematically biased, leading to unstable training and poorer generalization, and shows that moving the KL term back to the reward with a simple K1 estimator can boost out‑of‑domain performance by up to 20%.

AI researchKL regularizationLLM training
0 likes · 10 min read
Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Jan 6, 2026 · Artificial Intelligence

FinRS: A Risk‑Sensitive Trading Framework for Real‑World Financial Markets

FinRS integrates hierarchical market analysis, dual decision agents, and multi‑time‑scale reward feedback to enable risk‑aware multi‑stage trading, achieving higher cumulative returns, better Sharpe ratios, and lower maximum drawdowns than existing LLM‑based and reinforcement‑learning baselines across diverse stocks.

FinRSLLMfinancial markets
0 likes · 14 min read
FinRS: A Risk‑Sensitive Trading Framework for Real‑World Financial Markets
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Jan 4, 2026 · Artificial Intelligence

How VTA Combines Large‑Model Reasoning for Precise and Explainable Stock Time‑Series Forecasting

The VTA framework integrates large language model reasoning with textual annotation of technical indicators, employs a Time‑GRPO reinforcement‑learning objective and multi‑stage joint conditional training, and achieves state‑of‑the‑art accuracy and expert‑rated interpretability on US, Chinese and European stock datasets.

LLMVTAexplainability
0 likes · 19 min read
How VTA Combines Large‑Model Reasoning for Precise and Explainable Stock Time‑Series Forecasting
PaperAgent
PaperAgent
Dec 29, 2025 · Artificial Intelligence

Unveiling Bottom‑up Policy Optimization: Boosting LLM Reasoning with Internal Strategies

This article introduces Bottom‑up Policy Optimization (BuPO), a novel reinforcement‑learning framework that treats large language models as collections of internal layer and modular policies, revealing distinct inference entropy patterns in Llama and Qwen‑3 and demonstrating superior performance on complex mathematical reasoning benchmarks.

AI researchBottom-up OptimizationInternal Policy
0 likes · 10 min read
Unveiling Bottom‑up Policy Optimization: Boosting LLM Reasoning with Internal Strategies
Data Party THU
Data Party THU
Dec 28, 2025 · Artificial Intelligence

How Causal Reinforcement Learning Is Shaping Robust, Explainable AI

This comprehensive survey examines the emerging field of Causal Reinforcement Learning, classifies its core techniques, introduces eleven benchmark environments, evaluates four novel algorithms, and outlines challenges and future research directions for building robust, generalizable, and interpretable AI systems.

AI robustnessalgorithm evaluationbenchmark environments
0 likes · 12 min read
How Causal Reinforcement Learning Is Shaping Robust, Explainable AI
DataFunTalk
DataFunTalk
Dec 25, 2025 · Artificial Intelligence

How DeepAgent Redefines General AI Reasoning with Scalable Toolsets

DeepAgent, a new end‑to‑end reasoning agent, integrates autonomous thinking, dynamic tool search, and execution to handle over 16,000 APIs, embodied tasks, and research assistance, achieving state‑of‑the‑art performance on benchmarks like TMDB, ToolBench, ALFWorld, WebShop, and GAIA.

Large Language ModelMemory Managementreasoning
0 likes · 15 min read
How DeepAgent Redefines General AI Reasoning with Scalable Toolsets
PaperAgent
PaperAgent
Dec 23, 2025 · Artificial Intelligence

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

CATArena introduces a tournament‑style evaluation framework where AI agents iteratively code, compete, and improve across classic board games, using three‑dimensional quantitative scores to measure strategy programming, global learning, and generalization, and reveals how different LLM‑based agents learn and adapt over multiple rounds.

AI benchmarkAgent evaluationCATArena
0 likes · 8 min read
CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning
AI Info Trend
AI Info Trend
Dec 23, 2025 · Industry Insights

How AI Will Boost Collective Productivity: Key Takeaways from Microsoft’s 2025 Future of Work Report

Microsoft’s 2025 New Future of Work report reveals that AI, driven by breakthroughs in reinforcement learning, is shifting from answering questions to executing complex tasks, while investment and corporate adoption surge unevenly and employee involvement emerges as a critical factor for sustainable productivity gains.

AIGenerative AIMicrosoft Report
0 likes · 8 min read
How AI Will Boost Collective Productivity: Key Takeaways from Microsoft’s 2025 Future of Work Report
Bilibili Tech
Bilibili Tech
Dec 19, 2025 · Artificial Intelligence

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

SABER introduces a reinforcement‑learning framework that lets large language models dynamically switch among four token‑budgeted reasoning modes, dramatically cutting inference length while preserving or improving accuracy across math, code, and logic tasks.

Budgeted ComputationChain-of-ThoughtEfficient Reasoning
0 likes · 13 min read
SABER: Switchable and Balanced Training for Efficient LLM Reasoning
Instant Consumer Technology Team
Instant Consumer Technology Team
Dec 16, 2025 · Artificial Intelligence

How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power

This article explains how the Mind Lab team tackled the challenges of training a 1‑trillion‑parameter mixture‑of‑experts model for agentic memory using reinforcement learning, LoRA, and a custom Megatron‑Bridge architecture, achieving a ten‑fold speedup while consuming just a fraction of the usual GPU resources.

AIAgentic AppsLoRA
0 likes · 9 min read
How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Dec 15, 2025 · Artificial Intelligence

Turning LLM-Generated Network Configurations into Verified, Safe Updates with Artanis

The paper introduces Artanis, an intent‑based network configuration update framework that combines large‑language‑model generation with a verification‑feedback loop and reinforcement‑learning optimization, addressing hallucination‑induced errors and ensuring safe, policy‑compliant deployments across diverse network scales.

Intent-based NetworkingLLMconfiguration management
0 likes · 9 min read
Turning LLM-Generated Network Configurations into Verified, Safe Updates with Artanis
AntTech
AntTech
Dec 11, 2025 · Artificial Intelligence

Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

This article explains how the open‑source AReaL framework boosts large‑scale reinforcement learning by separating agent execution from training logic, introducing a decoupled Agentic RL service and a Single‑Controller architecture that improves data flow, fault tolerance, and GPU utilization.

Agentic AIOpen-sourceScalable RL
0 likes · 14 min read
Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design
AI Frontier Lectures
AI Frontier Lectures
Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models
0 likes · 12 min read
Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive
Data Party THU
Data Party THU
Dec 9, 2025 · Artificial Intelligence

Can Robots Learn Human Moves Directly from AI‑Generated Videos? The GenMimic Breakthrough

The GenMimic paper introduces a novel framework that enables humanoid robots to zero‑shot imitate human actions generated by AI video models, presenting a new dataset, a two‑stage 4D reconstruction pipeline, and a reinforcement‑learning strategy with weighted‑tracking and symmetry losses, validated in simulation and on a real 23‑DoF robot.

humanoid robotsreinforcement learningrobotics
0 likes · 11 min read
Can Robots Learn Human Moves Directly from AI‑Generated Videos? The GenMimic Breakthrough
Baidu Tech Salon
Baidu Tech Salon
Dec 8, 2025 · Artificial Intelligence

How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction

The article details Baidu HuiBosheng's end‑to‑end AI live‑streaming platform, covering merchant workflow, multimodal product understanding, style‑aware script generation, reinforcement‑learning‑driven smart control, voice and avatar cloning, and a data‑flywheel that continuously improves model performance, illustrated with real‑world GMV results.

AIData FlywheelLive Streaming
0 likes · 20 min read
How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Dec 7, 2025 · Artificial Intelligence

AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading

AlphaQuanter tackles the three major limitations of existing LLM trading agents by introducing a single‑agent framework that dynamically orchestrates market tools, learns transparent decision policies via reinforcement learning, and achieves state‑of‑the‑art performance on key financial metrics across extensive stock‑level experiments.

AlphaQuanterLLM Agentfinancial AI
0 likes · 13 min read
AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 7, 2025 · Artificial Intelligence

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Over recent months of extensive agent reinforcement‑learning experiments across search, data‑analysis, and multi‑source scenarios, the author shares twelve practical insights covering stability, environment‑reward‑algorithm priorities, tool‑call reliability, reward hacking pitfalls, evaluation alignment, and scaling tricks for larger models.

PPO EWMARL scalingTool Integration
0 likes · 7 min read
Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 7, 2025 · Artificial Intelligence

Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings

This article critically examines recent RL‑for‑LLM studies, revealing that reinforcement learning improves search efficiency but does not extend the intrinsic reasoning capabilities of base models, and explores the underlying model‑conditioned optimization bias, comparisons with SFT distillation, and the trade‑off with catastrophic forgetting.

Catastrophic ForgettingLLMModel Optimization
0 likes · 11 min read
Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings
Amazon Cloud Developers
Amazon Cloud Developers
Dec 4, 2025 · Artificial Intelligence

Building Reliable, Efficient AI Agents: Key Takeaways from Swami’s re:Invent 2025 Talk

Swami Sivasubramanian’s re:Invent 2025 keynote outlines a four‑pillar framework—Easy to Build, Efficiency, Trust, Reliability—to move AI agents from POC jail to production, detailing innovations such as Strands Agents SDK, Amazon Nova Act’s 90% reliability, Bedrock’s +66% fine‑tuning accuracy, Episodic Memory, and Reinforcement Fine‑Tuning, all backed by real‑world demos and benchmark results.

AI agentsAWSAgentic AI
0 likes · 16 min read
Building Reliable, Efficient AI Agents: Key Takeaways from Swami’s re:Invent 2025 Talk
AntTech
AntTech
Dec 4, 2025 · Artificial Intelligence

How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds

AState is a general‑purpose state data management system for reinforcement‑learning tasks that tackles low IO efficiency, slow weight synchronization, and state‑recovery challenges, achieving sub‑10‑second weight sync for trillion‑parameter models through a three‑layer architecture, zero‑redundancy transfers, and hardware‑aware co‑design, with the code openly available on GitHub.

AStateHigh-performance computinglarge models
0 likes · 23 min read
How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds
Model Perspective
Model Perspective
Dec 1, 2025 · Artificial Intelligence

From AI to Everyday Life: How Reinforcement Learning Shapes Our Choices

This article explains the core concepts of reinforcement learning, illustrates how its reward‑based mechanism appears in media creation, career advancement, education and social media, and warns of the pitfalls of over‑optimizing external rewards while offering practical ways to balance intrinsic motivation and reflective thinking.

MotivationSelf‑Improvementartificial-intelligence
0 likes · 12 min read
From AI to Everyday Life: How Reinforcement Learning Shapes Our Choices
PaperAgent
PaperAgent
Dec 1, 2025 · Artificial Intelligence

How Deep Research Turns LLMs into Autonomous AI Scientists

This article surveys the emerging Deep Research (DR) paradigm that upgrades large language models into research agents capable of autonomous planning, multi‑source evidence gathering, memory management, and verifiable long‑form report generation, outlining its stages, core components, training pipeline, and evaluation benchmarks.

AI agentsAI research automationDeep Research
0 likes · 6 min read
How Deep Research Turns LLMs into Autonomous AI Scientists
Data Party THU
Data Party THU
Nov 29, 2025 · Artificial Intelligence

Unlocking AI Agents: From Fundamentals to Building Your First LLM‑Powered Agent

This comprehensive guide explores the concept of AI agents, detailing their definitions, classifications, and core interaction loops, then walks you through building a functional LLM‑driven travel assistant with step‑by‑step code, tool integration, and practical insights on agent versus workflow paradigms.

AI agentsLLMTool Integration
0 likes · 39 min read
Unlocking AI Agents: From Fundamentals to Building Your First LLM‑Powered Agent
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Nov 28, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Nov 22‑28, 2025)

This digest summarizes five recent arXiv papers on AI-driven portfolio optimization and financial time‑series forecasting, covering G‑Learning with GIRL, transfer‑learning strategies, hybrid LSTM‑PPO frameworks, time‑series foundation models, and a KAN versus LSTM performance comparison, highlighting their methods, datasets, and reported Sharpe improvements.

financial AIportfolio optimizationreinforcement learning
0 likes · 9 min read
Weekly Quantitative Finance Paper Digest (Nov 22‑28, 2025)
Tencent Advertising Technology
Tencent Advertising Technology
Nov 28, 2025 · Artificial Intelligence

How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Retrv‑R1, a reasoning‑driven multimodal large language model framework, tackles the precision‑efficiency dilemma of universal multimodal retrieval by introducing a two‑stage coarse‑to‑fine pipeline, an information‑compression module, a detail‑inspection mechanism, and a three‑stage training strategy, achieving SOTA performance across accuracy, efficiency, and generalization benchmarks.

EfficiencyMLLMdetail inspection
0 likes · 21 min read
How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM
Alimama Tech
Alimama Tech
Nov 26, 2025 · Artificial Intelligence

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

Alibaba’s open‑source ROCK environment sandbox and the ROLL reinforcement‑learning engine together provide a standardized, high‑throughput training loop that lets developers scale Agentic AI from a single machine to thousands of parallel instances while simplifying debugging and resource management.

Agentic AIScalable Traininginfrastructure
0 likes · 12 min read
How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training
ITPUB
ITPUB
Nov 24, 2025 · Artificial Intelligence

Why Memory, Not Size, Is the Next Bottleneck for Large Language Models

In a detailed interview, the CTO of Memory Tensor (Shanghai) explains how limited memory capacity hampers large models, outlines the MemOS memory operating system, discusses information‑theoretic metrics, multimodal extensions, and reinforcement‑learning strategies for scalable, secure, and explainable AI memory management.

AI ArchitectureMultimodal AIinformation theory
0 likes · 23 min read
Why Memory, Not Size, Is the Next Bottleneck for Large Language Models
Data Party THU
Data Party THU
Nov 23, 2025 · Artificial Intelligence

Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

This article walks through the fundamentals of reinforcement learning, builds a custom drone‑landing simulation, defines state and action spaces, designs reward functions, implements a neural‑network policy with Bernoulli sampling, and trains it using REINFORCE with baseline techniques, while exposing common pitfalls such as reward‑cheating.

OpenAI GymPythondrone landing
0 likes · 22 min read
Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough
AntTech
AntTech
Nov 21, 2025 · Artificial Intelligence

How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models

Awex is a high‑performance Python framework that synchronizes training and inference weights for trillion‑parameter reinforcement‑learning models in seconds, using unified conversion, metadata management, and NCCL/RDMA transfer plans, dramatically reducing RL training latency and supporting diverse parallel strategies.

High-performance computingPythondistributed training
0 likes · 17 min read
How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 20, 2025 · Artificial Intelligence

How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools

DeepAgent is a new end‑to‑end reasoning agent that unifies autonomous thinking, dynamic tool search, and execution, handling over 16,000 real APIs, supporting embodied environments and research assistance, and achieving state‑of‑the‑art results across multiple benchmarks through its unified reasoning core, memory‑folding mechanisms, structured memory, and the ToolPO training framework.

AI agentsGeneral AITool Integration
0 likes · 14 min read
How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 20, 2025 · Artificial Intelligence

How DeepAgent Redefines AI Agents with Memory Folding and ToolPO

This article breaks down the DeepAgent paper, explaining its novel "main model + auxiliary model" architecture, the memory‑folding mechanism that compresses long‑context reasoning, and the ToolPO reinforcement strategy that enables efficient tool discovery and usage.

AI agentsToolPOlarge language models
0 likes · 8 min read
How DeepAgent Redefines AI Agents with Memory Folding and ToolPO
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 20, 2025 · Artificial Intelligence

Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning

The article analyzes why reinforcement learning (RL) fine‑tuning retains a large language model's general abilities better than supervised fine‑tuning (SFT), explaining the off‑policy distribution shift of SFT and the on‑policy data consistency, KL penalty, and trust‑region mechanisms that give RL its anti‑forgetting properties.

Catastrophic ForgettingLLMOn-Policy Data
0 likes · 8 min read
Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 19, 2025 · Artificial Intelligence

How We Built an AI‑Powered Automated Video Editing Pipeline for Short‑Form Marketing

This article details the end‑to‑end AIGC video automation system we created—from raw material ingestion and multimodal content understanding to script generation, AI‑driven editing, rendering, and multi‑channel distribution—highlighting architecture, key modules, technical choices, performance results, and lessons learned.

AIGCMultimodal AIScript Generation
0 likes · 16 min read
How We Built an AI‑Powered Automated Video Editing Pipeline for Short‑Form Marketing
AI Tech Publishing
AI Tech Publishing
Nov 17, 2025 · Artificial Intelligence

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

The article evaluates nine cutting‑edge AI models on 150 simulated workplace tasks, showing that even the strongest models complete fewer than 40% of tasks, and uses these results to propose a hierarchical framework of agentic capabilities ranging from tool use to common‑sense reasoning.

AI model evaluationTool Useagentic capabilities
0 likes · 19 min read
Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy
Data Party THU
Data Party THU
Nov 15, 2025 · Artificial Intelligence

How Reinforcement Learning Powers Intelligent AI Agents and LangGraph Workflows

This article explains how reinforcement learning (RL) underpins intelligent AI agents, covering the Markov Decision Process fundamentals, key RL components, multi‑hop reasoning on knowledge graphs, and a step‑by‑step LangGraph example that integrates an RL‑driven tutoring policy with Python code.

AI agentsKnowledge GraphLangGraph
0 likes · 17 min read
How Reinforcement Learning Powers Intelligent AI Agents and LangGraph Workflows
Kuaishou Tech
Kuaishou Tech
Nov 14, 2025 · Artificial Intelligence

How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators

This article explains the over‑optimization problem in GRPO‑based flow models, analyzes why importance‑ratio clipping fails, and introduces GRPO‑Guard with RatioNorm and cross‑step gradient balancing, showing through extensive experiments that it stabilizes training and improves image quality across multiple diffusion backbones and tasks.

GRPO-GuardGenerative AIflow matching
0 likes · 9 min read
How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators
Smart Era Software Development
Smart Era Software Development
Nov 14, 2025 · Artificial Intelligence

AsyncThink: How Microsoft’s Agentic Organization Turns LLMs into Project Managers

The paper introduces AsyncThink, a novel "agentic organization" paradigm that lets large language models dynamically fork, join, and coordinate multiple reasoning agents, achieving higher accuracy and lower latency than traditional chain‑of‑thought or parallel‑thinking approaches across math, Sudoku, graph, and genetics tasks.

Agentic OrganizationAsyncThinkFork‑Join
0 likes · 8 min read
AsyncThink: How Microsoft’s Agentic Organization Turns LLMs into Project Managers
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Nov 13, 2025 · Artificial Intelligence

Paper Review: AlphaGAT’s Two‑Stage Learning for Adaptive Portfolio Selection

AlphaGAT introduces a two‑stage learning framework that first extracts robust alpha factors with a CATimeMixer model and a novel loss, then dynamically weights these factors via reinforcement learning (PPO) and a graph attention network, achieving superior portfolio performance across DJIA, HSI, CSI‑100 and crypto markets despite noisy data and distribution shifts.

AlphaGATfinancial AIgraph attention network
0 likes · 14 min read
Paper Review: AlphaGAT’s Two‑Stage Learning for Adaptive Portfolio Selection
Alimama Tech
Alimama Tech
Nov 11, 2025 · Artificial Intelligence

Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards

This article introduces the 3A collaborative framework—Async architecture, Asymmetric PPO mini‑critics, and an attention‑based reasoning rhythm—demonstrating how decoupled, fine‑grained parallel training and structure‑aware reward allocation dramatically improve efficiency, scalability, and interpretability of reinforcement learning for large language models.

Asynchronous Trainingattention mechanismslarge language models
0 likes · 23 min read
Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards
DataFunTalk
DataFunTalk
Nov 7, 2025 · Artificial Intelligence

Training-Free GRPO: Low‑Cost Reinforcement Learning for Large Language Models

Training-Free GRPO, proposed by Tencent Youtu Lab, eliminates parameter updates by iteratively building an experience knowledge base, enabling cost‑effective reinforcement learning for large language models, dramatically reducing training expenses from thousands of dollars to under $20 while maintaining strong performance across math reasoning and web search tasks.

AIcost reductionreinforcement learning
0 likes · 6 min read
Training-Free GRPO: Low‑Cost Reinforcement Learning for Large Language Models
Architect's Guide
Architect's Guide
Nov 7, 2025 · Artificial Intelligence

Why Multi-Agent Communication Protocols Are Crucial for Next-Gen AI

The article examines the need for Multi‑Agent Communication Protocols (MCP), outlines the limitations of single‑agent and centralized systems, compares MCP with other interaction methods, reviews current research and industrial applications, and highlights future directions such as hardware integration, bio‑inspired mechanisms, and blockchain convergence.

Graph Neural NetworksMulti-Agent Systemsblockchain
0 likes · 9 min read
Why Multi-Agent Communication Protocols Are Crucial for Next-Gen AI
Kuaishou Tech
Kuaishou Tech
Nov 5, 2025 · Artificial Intelligence

How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy

This article explains the overthinking problem of large language models, introduces the HiPO framework with hybrid data cold‑start and reinforcement‑learning reward mechanisms that let models decide when to think deeply or answer directly, and shows experimental results demonstrating significant efficiency gains and accuracy improvements across multiple benchmarks.

EfficiencyHybrid Policy OptimizationLLM
0 likes · 13 min read
How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Nov 4, 2025 · Artificial Intelligence

SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously

SEAgent introduces a self‑evolving framework that enables a GUI agent to master unfamiliar software through autonomous exploration and experience learning, leveraging a curriculum generator, a world‑state model, and GRPO‑based reinforcement with adversarial imitation, achieving state‑of‑the‑art performance on OSWorld.

GUI automationSEAgentautonomous learning
0 likes · 6 min read
SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously
Bilibili Tech
Bilibili Tech
Oct 31, 2025 · Artificial Intelligence

RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization) introduces an adversarial game between a reward model and a translation LLM, combining qualitative preference rewards with quantitative metrics like BLEU, to overcome distribution shift in RLHF and achieve superior performance on conversational subtitle and WMT translation tasks.

BLEULLMMachine Translation
0 likes · 13 min read
RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 31, 2025 · Artificial Intelligence

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

This article analyzes why standard reinforcement learning can degrade Pass@K metrics after fine‑tuning large language models, introduces a risk‑sensitive RL objective that reshapes the advantage estimator, and demonstrates through bandit and mathematical‑reasoning experiments that the RS‑GRPO method consistently boosts diversity and overall Pass@K scores across multiple LLMs.

Exploration-ExploitationLLM fine-tuningRS-GRPO
0 likes · 12 min read
How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 31, 2025 · Artificial Intelligence

Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study

Meta’s recent paper reveals a sigmoid‑shaped scaling law for LLM reinforcement learning, presents extensive 40‑k GPU‑hour experiments, compares various RL designs such as PPO‑off‑policy‑k and Pipeline‑RL‑k, and distills the findings into a practical “ScaleRL” recipe that improves performance and efficiency.

LLMRL OptimizationScaling Law
0 likes · 10 min read
Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study
DataFunTalk
DataFunTalk
Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

Model EfficiencyOn‑Policy Distillationknowledge distillation
0 likes · 15 min read
How On-Policy Distillation Cuts LLM Training Cost by 90%
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 30, 2025 · Artificial Intelligence

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

AILLMReward Modeling
0 likes · 11 min read
Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks
Instant Consumer Technology Team
Instant Consumer Technology Team
Oct 28, 2025 · Artificial Intelligence

How 7B AgentFlow Beats 200B GPT-4o: Small Models, Big Wins

AgentFlow, a Stanford-led multi‑agent system built on a 7B model, outperforms massive models like GPT‑4o across ten benchmarks by leveraging modular agents, on‑policy learning, and a novel Flow‑GRPO training engine that solves sparse‑reward, long‑horizon challenges.

AgentFlowMulti-Agent SystemsSmall Model Performance
0 likes · 12 min read
How 7B AgentFlow Beats 200B GPT-4o: Small Models, Big Wins
Data Party THU
Data Party THU
Oct 24, 2025 · Artificial Intelligence

BREEZE: Enhancing Zero‑Shot Reinforcement Learning with Behavioral Regularization

The paper introduces BREEZE, a behavior‑regularized zero‑shot RL framework that improves stability, policy extraction, and representation quality by combining in‑sample learning, task‑conditioned diffusion models, and expressive attention‑based architectures, achieving near‑state‑of‑the‑art performance on benchmarks like ExORL and D4RL Kitchen.

Offline RLbehavioral regularizationdiffusion model
0 likes · 3 min read
BREEZE: Enhancing Zero‑Shot Reinforcement Learning with Behavioral Regularization
Data Party THU
Data Party THU
Oct 22, 2025 · Artificial Intelligence

Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions

This article provides a comprehensive introduction to reinforcement learning for large language models, covering the Markov Decision Process formulation, the four core elements of RL, state‑value and action‑value functions, Bellman equations, and the advantage function that underpins modern policy‑gradient algorithms.

AI FundamentalsBellman equationLarge Language Model
0 likes · 13 min read
Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions
Data Party THU
Data Party THU
Oct 21, 2025 · Artificial Intelligence

Why DQN Overestimates Q‑Values and How Double DQN Fixes It

The article explains how DQN’s use of the max operator introduces a maximization bias that leads to overestimated Q‑values, and shows how Double DQN separates action selection from value evaluation to eliminate this bias, improving stability and performance in Atari benchmarks.

DQNDouble DQNalgorithm analysis
0 likes · 7 min read
Why DQN Overestimates Q‑Values and How Double DQN Fixes It
Data Thinking Notes
Data Thinking Notes
Oct 19, 2025 · Artificial Intelligence

How GSPO Improves Stability in Large Language Model Training

GSPO (Group Sequence Policy Optimization) is a reinforcement‑learning algorithm for LLMs that replaces token‑level GRPO with sequence‑level optimization, addressing instability in ultra‑large model training, especially for long‑sequence and MoE architectures, by aligning reward granularity and reducing variance.

GRPOGSPOlarge language models
0 likes · 11 min read
How GSPO Improves Stability in Large Language Model Training
Xiaohe Frontend Team
Xiaohe Frontend Team
Oct 15, 2025 · Artificial Intelligence

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meta’s new REFRAG framework lets a lightweight encoder compress retrieved text into semantic tags, enabling large language models to answer queries with far fewer tokens, lower latency, and higher throughput, while preserving core meaning and allowing flexible placement of compressed information within prompts.

LLM efficiencyRAGmodel compression
0 likes · 8 min read
REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI
Meituan Technology Team
Meituan Technology Team
Oct 15, 2025 · Artificial Intelligence

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

This curated list showcases Meituan’s latest large‑model breakthroughs and academic papers up to October 2025, spanning LLM system optimizations, multimodal generation, evaluation benchmarks, quantization techniques, and reinforcement‑learning‑driven improvements, offering researchers valuable insights and resources across the AI landscape.

AI researchBenchmarkingMultimodal AI
0 likes · 10 min read
What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025