Tagged articles

reinforcement learning

743 articles · Page 3 of 8

Feb 15, 2026 · Artificial Intelligence

Why Memory Is the Next Critical Infrastructure for AI Agents

This survey reviews over 200 papers to propose a three‑dimensional classification framework for foundation‑agent memory, analyzes paradigm shifts from model‑centric to utility‑centric AI, and outlines memory substrates, cognitive mechanisms, operation strategies, learning paradigms, evaluation metrics, applications, and future research directions.

AI agentsFoundation ModelsMemory Mechanisms

0 likes · 10 min read

Why Memory Is the Next Critical Infrastructure for AI Agents

Top Architect

Feb 14, 2026 · Artificial Intelligence

Why Test‑Time Compute Is the Next Breakthrough for Large Language Models

The article explains how inference‑oriented large language models shift the focus from training‑time resources to test‑time computation, detailing scaling laws, verification techniques, reinforcement‑learning pipelines such as DeepSeek‑R1, and methods for distilling reasoning abilities into smaller, consumer‑grade models.

Prompt engineeringinference computelarge language models

0 likes · 19 min read

Why Test‑Time Compute Is the Next Breakthrough for Large Language Models

Baidu Intelligent Cloud Tech Hub

Feb 12, 2026 · Artificial Intelligence

Deploying GLM-5 on Baidu Kunlun P800 XPU with vLLM‑Kunlun Plugin

This article explains how Baidu's new GLM-5 large model is adapted to the Kunlun P800 XPU, detailing the async reinforcement learning framework Slime, optimization techniques like INT8 quantization and tensor‑parallelism, and provides step‑by‑step deployment commands using the open‑source vLLM‑Kunlun plugin.

AI accelerationGLM-5INT8 Quantization

0 likes · 6 min read

Deploying GLM-5 on Baidu Kunlun P800 XPU with vLLM‑Kunlun Plugin

Open Source Tech Hub

Feb 12, 2026 · Artificial Intelligence

How GLM-5 Advances AI with Bigger Scale, Sparse Attention, and Agent Capabilities

GLM-5, a new large language model with 744 B parameters and 28.5 T tokens of training data, introduces DeepSeek sparse attention and an asynchronous RL system called slime, delivering strong benchmark gains on complex system engineering, long‑horizon agent tasks, and surpassing many open‑source competitors.

AIBenchmarkingGLM-5

0 likes · 6 min read

How GLM-5 Advances AI with Bigger Scale, Sparse Attention, and Agent Capabilities

DeWu Technology

Feb 11, 2026 · Artificial Intelligence

How Generative Models Transform Re‑ranking Architecture for Faster, More Diverse Recommendations

This article examines the evolution of re‑ranking systems from traditional pointwise models to a two‑stage generation‑evaluation framework, compares autoregressive and non‑autoregressive generative approaches, details inference speed optimizations with GPU and model‑server upgrades, and outlines a future end‑to‑end sequence generation architecture enhanced by reinforcement learning and contrastive learning.

AIInference OptimizationRecommendation Systems

0 likes · 14 min read

How Generative Models Transform Re‑ranking Architecture for Faster, More Diverse Recommendations

Ximalaya Technology Team

Feb 11, 2026 · Artificial Intelligence

How Ximalaya Used Generative AI to Revolutionize Audio Recommendations

This article details Ximalaya's journey from traditional multi‑stage recommendation pipelines to generative AI‑driven models, covering business challenges, architectural and model differences, phased deployments, knowledge distillation, semantic ID encoding, decoder‑only strategies, extensive offline and online evaluations, and future research directions.

Encoder-DecoderGenerative AIRecommendation Systems

0 likes · 24 min read

How Ximalaya Used Generative AI to Revolutionize Audio Recommendations

Machine Learning Algorithms & Natural Language Processing

Feb 10, 2026 · Artificial Intelligence

Why Self‑Distillation Is the 2026 Keyword for Continual Learning in Large Models

At the start of 2026, self‑distillation dominates the most cited LLM papers, offering a teacher‑free way for large models to continually acquire new knowledge while preserving existing capabilities.

Continual LearningSelf‑Distillationlarge language models

0 likes · 9 min read

Why Self‑Distillation Is the 2026 Keyword for Continual Learning in Large Models

AI Frontier Lectures

Feb 10, 2026 · Artificial Intelligence

Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens

FaithLens is an 8‑billion‑parameter model that surpasses GPT‑4.1 and other large models on twelve hallucination‑detection benchmarks while providing high‑quality natural‑language explanations, thanks to a novel data‑synthesis pipeline, three‑dimensional filtering, and rule‑based reinforcement learning.

Efficient InferenceLLM hallucinationexplainable AI

0 likes · 12 min read

Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens

AI Frontier Lectures

Feb 10, 2026 · Artificial Intelligence

How SE‑Bench Uncovers the Hidden Challenges of Knowledge Internalization in Self‑Evolving AI

The paper introduces SE‑Bench, a code‑based benchmark that isolates knowledge internalization by obfuscating NumPy APIs, and uses it to reveal the Open‑Book paradox, the RL gap, and the potential of self‑play for true self‑evolution in large language models.

AISE-Benchknowledge internalization

0 likes · 17 min read

How SE‑Bench Uncovers the Hidden Challenges of Knowledge Internalization in Self‑Evolving AI

PaperAgent

Feb 7, 2026 · Artificial Intelligence

Can 13 Parameters Match Full‑Scale Fine‑Tuning? TinyLoRA’s RL Breakthrough

TinyLoRA, a Meta‑proposed method that fine‑tunes Qwen2.5‑7B with only 13 trainable parameters (26 bytes), achieves 91% accuracy on GSM8K under reinforcement learning, revealing that ultra‑low‑parameter RL can rival full‑scale supervised fine‑tuning.

GSM8KQwen2.5TinyLoRA

0 likes · 7 min read

Can 13 Parameters Match Full‑Scale Fine‑Tuning? TinyLoRA’s RL Breakthrough

AI Frontier Lectures

Feb 6, 2026 · Artificial Intelligence

Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?

The paper introduces Mixture‑of‑Visual‑Thoughts (MoVT), a context‑adaptive reasoning paradigm that integrates pure‑text and visually‑grounded inference modes within a single model, and presents the two‑stage AdaVaR training framework with a novel AdaGRPO reinforcement‑learning algorithm to automatically select the optimal mode for each visual‑language task, achieving consistent gains across eight benchmarks and surpassing strong baselines including GPT‑4o.

AdaVaRMixture-of-Visual-ThoughtsVisual Reasoning

0 likes · 16 min read

Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?

HyperAI Super Neural

Feb 6, 2026 · Artificial Intelligence

Latest Advances in AI Agents: PaperBanana, SDPO, Lumine, Idea2Story, and Insight Agents

This weekly roundup highlights five recent AI agent papers—PaperBanana for automated academic illustration, SDPO's self‑distillation reinforcement learning, Lumine's open‑world generalist agent, Idea2Story's pipeline for turning research ideas into narratives, and Insight Agents' fast e‑commerce insights—showcasing diverse breakthroughs in multi‑agent frameworks, self‑feedback learning, and real‑world deployment.

AI agentsMulti-Agent SystemsSelf‑Distillation

0 likes · 8 min read

Latest Advances in AI Agents: PaperBanana, SDPO, Lumine, Idea2Story, and Insight Agents

Alimama Tech

Feb 5, 2026 · Artificial Intelligence

Can Few-Shot Reinforcement Learning Supercharge Budget-Constrained Auto-Bidding?

This paper introduces ABPlanner, a few‑shot, context‑aware budget planner that enhances budget‑constrained auto‑bidding in online advertising by hierarchically allocating budgets across short‑term stages and training a sequential decision‑maker with deep reinforcement learning, achieving significant gains in simulated and real‑world A/B tests.

Online Advertisingauto-biddingbudget allocation

0 likes · 13 min read

Can Few-Shot Reinforcement Learning Supercharge Budget-Constrained Auto-Bidding?

Baobao Algorithm Notes

Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM OptimizationLinear Attention

0 likes · 22 min read

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

Baobao Algorithm Notes

Feb 4, 2026 · Artificial Intelligence

Mastering Reinforcement Learning: From Basics to Advanced Agentic RL Techniques

This comprehensive guide walks through reinforcement learning fundamentals, MDP modeling, value functions, Bellman equations, and key algorithms such as Q‑learning, REINFORCE, PPO, DPO, and GRPO, then contrasts LLM‑RL with Agentic‑RL and surveys leading industry frameworks and real‑world applications.

Agentic RLLLMRL Algorithms

0 likes · 42 min read

Mastering Reinforcement Learning: From Basics to Advanced Agentic RL Techniques

Xiaomi Tech

Feb 3, 2026 · Artificial Intelligence

Xiaomi’s AI Research Secures Spots on ICLR 2026 – Papers and Key Findings

The International Conference on Learning Representations (ICLR) 2026 accepted multiple Xiaomi papers covering multimodal reasoning, reinforcement learning, GUI agents, autonomous driving, audio generation and benchmark design, each presenting novel frameworks, data‑centric training tricks and strong experimental results that advance the state of the art.

ICLR 2026Multimodal LearningXiaomi

0 likes · 17 min read

Xiaomi’s AI Research Secures Spots on ICLR 2026 – Papers and Key Findings

Baidu Geek Talk

Feb 2, 2026 · Artificial Intelligence

How Cloud AI Infra Powers the Next Wave of Embodied Intelligence

This article outlines the rapid rise of embodied intelligence, the explosion of Vision‑Language‑Action (VLA) research, and how cloud‑based AI infrastructure—including multi‑level IaaS, data pipelines, dual‑system model designs, and reinforcement‑learning workflows—addresses emerging scaling and deployment challenges.

VLAmultimodal modelsreinforcement learning

0 likes · 13 min read

How Cloud AI Infra Powers the Next Wave of Embodied Intelligence

DaTaobao Tech

Feb 2, 2026 · Operations

How Policy Regularization Boosts Deep Reinforcement Learning for Large‑Scale Inventory Management

This article presents DeepStock, a deep reinforcement learning framework with policy regularization that integrates classic inventory heuristics, achieving 7% turnover reduction and multi‑million cost savings across millions of SKU‑warehouse pairs in Alibaba's self‑operated ecosystem.

deep learningindustrial AIinventory management

0 likes · 18 min read

How Policy Regularization Boosts Deep Reinforcement Learning for Large‑Scale Inventory Management

Data Party THU

Jan 31, 2026 · Artificial Intelligence

Can LLMs Learn While Being Tested? Inside the TTT-Discover Breakthrough

The article examines the Test‑Time Training to Discover (TTT‑Discover) approach, which applies reinforcement learning during inference to let large language models continuously improve on single test problems, and reports strong results across mathematics, GPU kernel optimization, algorithm design, and biology.

AI researchLLMScientific Discovery

0 likes · 9 min read

Can LLMs Learn While Being Tested? Inside the TTT-Discover Breakthrough

JD Tech

Jan 31, 2026 · Artificial Intelligence

How JD's 9N‑LLM Engine Powers Scalable Generative Recommendation at Massive Scale

This article details JD Retail's 9N‑LLM unified training framework that tackles the massive data, hardware heterogeneity, and algorithmic challenges of generative recommendation by integrating TensorFlow and PyTorch, supporting GPU/NPU, and delivering high‑throughput sample processing, sparse/dense optimization, and flexible reinforcement‑learning capabilities.

GPU/NPURaylarge-scale AI

0 likes · 26 min read

How JD's 9N‑LLM Engine Powers Scalable Generative Recommendation at Massive Scale

JD Retail Technology

Jan 30, 2026 · Artificial Intelligence

How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale

The article details JD Retail’s 9N‑LLM unified training engine—supporting TensorFlow and PyTorch, GPU and NPU, and both traditional and generative recommendation scenarios—explaining its architecture, high‑throughput sample engine, distributed sparse embedding system, five‑stage pipeline, UniAttention accelerator, and reinforcement‑learning capabilities that together enable TB‑scale data, B‑scale dense parameters, and efficient RL training for real‑world recommendation services.

GPU/NPUUniAttentiondistributed training

0 likes · 26 min read

How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale

DaTaobao Tech

Jan 30, 2026 · Artificial Intelligence

Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling

This article proposes an ASR‑driven pipeline that creates high‑quality AI‑reply vs. human‑like reply pairs, trains a rewrite model and a reward model, and uses GRPO reinforcement learning to generate natural, helpful, and less AI‑sounding responses in digital‑human live streaming, achieving 92% accuracy and 97% helpfulness while improving user experience.

ASR dataLLMQwen

0 likes · 20 min read

Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling

AI Engineering

Jan 30, 2026 · Artificial Intelligence

Why Letting LLMs Argue Improves Their Reasoning Quality

Google’s recent study of over 8,000 reasoning tasks shows that advanced LLMs like DeepSeek‑R1 spontaneously develop multiple internal “expert” personas that debate, and that activating a discovered “social switch” dramatically raises accuracy, revealing that engineered conflict can enhance AI reasoning.

AI debateFeature ControlLLM

0 likes · 8 min read

Why Letting LLMs Argue Improves Their Reasoning Quality

PaperAgent

Jan 30, 2026 · Artificial Intelligence

How LLM‑in‑Sandbox Turns Large Models into General‑Purpose Agents Without Extra Training

The LLM‑in‑Sandbox framework places large language models inside a virtual machine that provides external tool access, persistent storage, and code execution, yielding up to a 24.2% performance boost across six benchmark tasks without additional training, and it scales from zero‑shot to reinforcement‑learning‑enhanced agents while remaining cost‑effective.

Agentic AIEfficiencyLLM

0 likes · 6 min read

How LLM‑in‑Sandbox Turns Large Models into General‑Purpose Agents Without Extra Training

Meituan Technology Team

Jan 29, 2026 · Artificial Intelligence

How LongCat‑Flash‑Thinking‑2601 Achieves Real‑World Generalization for Agents

LongCat‑Flash‑Thinking‑2601, a 560‑billion‑parameter MoE model, combines environment expansion, multi‑environment RL, systematic noise training, a heavy‑thinking reasoning mode, and Zigzag sparse attention to deliver strong benchmark performance and robust real‑world agent capabilities.

Environment ExpansionLarge Language ModelZigzag Attention

0 likes · 14 min read

How LongCat‑Flash‑Thinking‑2601 Achieves Real‑World Generalization for Agents

Alibaba Cloud Developer

Jan 28, 2026 · Artificial Intelligence

How We Built a High‑Performance AI Rental Advisor with One‑Model Tool‑Use and Reinforcement Learning

This article details the design, challenges, and performance gains of an AI‑driven rental recommendation system that replaces a multi‑agent architecture with a single LLM using dynamic tool‑use, introduces a two‑stage reinforcement‑learning pipeline, and achieves sub‑second latency and higher accuracy for complex rental scenarios.

AI recommendationLarge Language ModelTool Use

0 likes · 19 min read

How We Built a High‑Performance AI Rental Advisor with One‑Model Tool‑Use and Reinforcement Learning

PaperAgent

Jan 25, 2026 · Artificial Intelligence

How Deep GraphRAG Solves Retrieval’s Three‑Way Dilemma with Hierarchical Search

Deep GraphRAG tackles the three‑fold dilemma of traditional Retrieval‑Augmented Generation by introducing hierarchical global‑to‑local retrieval, a beam‑search dynamic reordering that cuts latency, and a DW‑GRPO reinforcement‑learning module that adaptively weights rewards, achieving near‑state‑of‑the‑art performance with up to 86% faster inference.

AI researchGraphRAGHierarchical Retrieval

0 likes · 5 min read

How Deep GraphRAG Solves Retrieval’s Three‑Way Dilemma with Hierarchical Search

Meituan Technology Team

Jan 23, 2026 · Artificial Intelligence

How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning

EvoCUA, a native computer‑use agent from Meituan, combines a verifiable data‑synthesis engine, a ten‑thousand‑level sandbox infrastructure, and an experience‑driven learning paradigm to overcome data‑scaling and feedback challenges, achieving a 56.7% success rate on the OSWorld benchmark and surpassing previous open‑source models.

AI AgentComputer UseData Synthesis

0 likes · 27 min read

How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning

Amazon Cloud Developers

Jan 22, 2026 · Artificial Intelligence

How Amazon’s New Bedrock and SageMaker Features Cut AI Agent Costs and Speed Up Customization

The article explains how Amazon Bedrock’s reinforced fine‑tuning and SageMaker AI’s new serverless model‑customization dramatically lower the cost and latency of AI agents, delivering up to a 73% accuracy boost and shrinking model‑building cycles from months to days for enterprises of any size.

AI AgentAmazon BedrockAmazon SageMaker

0 likes · 9 min read

How Amazon’s New Bedrock and SageMaker Features Cut AI Agent Costs and Speed Up Customization

Tencent Advertising Technology

Jan 22, 2026 · Artificial Intelligence

How Tencent’s Bidding Algorithms Evolved from GMPC to GRB: A Deep Dive into Generative RL for Ads

The article reviews the 2025 evolution of Tencent advertising’s bidding system—from the second‑generation GMPC control algorithm through the third‑generation MRB reinforcement‑learning model to the fourth‑generation generative RL GRB—detailing architectural upgrades, multi‑channel modeling, training pipelines, and experimental gains, and outlines the 2026 AI‑agent roadmap.

Advertisingalgorithmbidding

0 likes · 15 min read

How Tencent’s Bidding Algorithms Evolved from GMPC to GRB: A Deep Dive into Generative RL for Ads

Tencent Cloud Developer

Jan 20, 2026 · Artificial Intelligence

From Transformers to Agents: A Complete Timeline of Large Language Model Evolution

This article traces the evolution of large language models from the 2017 Transformer breakthrough through successive milestones such as BERT, GPT‑3, RL‑HF alignment, multimodal extensions, open‑source alternatives, and the rise of retrieval‑augmented generation, AI agents, and emerging protocols that shape modern AI applications.

Prompt engineeringRAGlarge language models

0 likes · 44 min read

From Transformers to Agents: A Complete Timeline of Large Language Model Evolution

PaperAgent

Jan 19, 2026 · Artificial Intelligence

How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions

Recent research shows that applying reinforcement learning to large language models can dramatically improve inference performance, but its effectiveness depends on the token distribution produced during pre‑training, prompting a novel rewrite of cross‑entropy as a single‑step policy gradient with controllable entropy parameters.

LLMModel OptimizationRL

0 likes · 6 min read

How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions

PaperAgent

Jan 16, 2026 · Artificial Intelligence

How a 4B Model Beats 30B Giants: Inside AgentCPM-Explore’s SOTA Performance

AgentCPM-Explore, a 4‑billion‑parameter open‑source model, achieves state‑of‑the‑art results on long‑range exploration tasks, matching or surpassing larger 8B and even 30B models, thanks to a full‑stack infrastructure, novel training tricks, and extensive benchmark evaluations across eight agent‑centric datasets.

AgentAgentCPM-ExploreLarge Language Model

0 likes · 10 min read

How a 4B Model Beats 30B Giants: Inside AgentCPM-Explore’s SOTA Performance

Xiaohongshu Tech REDtech

Jan 15, 2026 · Information Security

How Hi-Guard Improves Trustworthy Multimodal Content Moderation with Policy‑Aligned Reasoning

The Hi-Guard framework transforms content moderation by aligning multimodal models with policy rules through hierarchical prompting, a structured taxonomy, and soft‑margin reinforcement learning, achieving significant gains in accuracy, precision, recall, and explainability for large‑scale user‑generated content platforms.

Multimodal AIcontent moderationexplainability

0 likes · 9 min read

How Hi-Guard Improves Trustworthy Multimodal Content Moderation with Policy‑Aligned Reasoning

Amap Tech

Jan 14, 2026 · Artificial Intelligence

How ArenaRL Enables Open‑World Travel Agents to Learn via Comparative Reinforcement Learning

Gaode Maps and Tongyi DeepResearch unveil ArenaRL, an open‑domain reinforcement‑learning framework that replaces absolute scoring with relative ranking, uses self‑play and a linear‑complexity tournament, and demonstrates measurable gains on POI ranking and complex travel‑planning tasks.

ArenaRLRankingopen-domain

0 likes · 8 min read

How ArenaRL Enables Open‑World Travel Agents to Learn via Comparative Reinforcement Learning

Bighead's Algorithm Notes

Jan 11, 2026 · Artificial Intelligence

FinRpt: A Multi‑Agent Framework for Automatic Generation and Evaluation of Stock Research Reports

FinRpt introduces a novel multi‑agent pipeline that builds a high‑quality stock research report (ERR) dataset from six financial data sources, defines a comprehensive 11‑metric evaluation suite, and demonstrates that supervised‑fine‑tuned and reinforcement‑learned LLM agents significantly outperform single LLM baselines in both accuracy and efficiency.

FinRptFinancial NLPLLM

0 likes · 14 min read

FinRpt: A Multi‑Agent Framework for Automatic Generation and Evaluation of Stock Research Reports

AI Engineering

Jan 10, 2026 · Artificial Intelligence

Teaching LLMs to Manage Memory Autonomously, Dropping Manual Rules

Alibaba's new AgeMem framework turns long‑term and short‑term memory management for large language model agents into a learnable reinforcement‑learning task, replacing handcrafted rules with a three‑stage training process and achieving significant benchmark gains.

AgeMemGRPOLLM

0 likes · 9 min read

Teaching LLMs to Manage Memory Autonomously, Dropping Manual Rules

Bighead's Algorithm Notes

Jan 8, 2026 · Artificial Intelligence

Alpha‑R1: Reinforcement‑Learning‑Driven Large‑Model Alpha Factor Selection

Alpha‑R1 integrates reinforcement learning with an 8‑billion‑parameter LLM to jointly process price and news data, creating context‑aware factor embeddings that outperform traditional quantitative and generic LLM baselines on CSI 300 and CSI 1000 portfolios, demonstrating robust alpha‑decay resistance and zero‑sample generalization.

Large Language Modelalpha factor selectionfinancial AI

0 likes · 16 min read

Alpha‑R1: Reinforcement‑Learning‑Driven Large‑Model Alpha Factor Selection

Amap Tech

Jan 8, 2026 · Artificial Intelligence

How AI Powers Fancy Video Generation for Real‑World POI Scenes

This article details the AI techniques behind Gaode's "Street Ranking" project, explaining the Fancy video concept, the dual training and production pipelines, and the use of SFT, reinforcement learning, MoE‑LoRA, distribution‑matching distillation, and quality‑filtering to achieve 25× faster generation with high aesthetic fidelity.

AI video generationDistillationMultimodal

0 likes · 24 min read

How AI Powers Fancy Video Generation for Real‑World POI Scenes

Tencent Advertising Technology

Jan 8, 2026 · Artificial Intelligence

How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking

Tencent's ad tech team redesigned its ad ranking system by adding a parallel user‑experience‑optimized pipeline and evolving from manual CEM tuning to DDPG‑based reinforcement learning, achieving 10‑20% improvements in CTR, repeat‑view rates, and other experience metrics while maintaining overall spend.

AdvertisingRankingmulti-objective optimization

0 likes · 17 min read

How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking

Data Party THU

Jan 7, 2026 · Artificial Intelligence

Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

A recent study reveals that the widely used KL regularization in LLM reinforcement learning (RLVR) is mathematically biased, leading to unstable training and poorer generalization, and shows that moving the KL term back to the reward with a simple K1 estimator can boost out‑of‑domain performance by up to 20%.

AI researchKL regularizationLLM training

0 likes · 10 min read

Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

Bighead's Algorithm Notes

Jan 6, 2026 · Artificial Intelligence

FinRS: A Risk‑Sensitive Trading Framework for Real‑World Financial Markets

FinRS integrates hierarchical market analysis, dual decision agents, and multi‑time‑scale reward feedback to enable risk‑aware multi‑stage trading, achieving higher cumulative returns, better Sharpe ratios, and lower maximum drawdowns than existing LLM‑based and reinforcement‑learning baselines across diverse stocks.

FinRSLLMfinancial markets

0 likes · 14 min read

FinRS: A Risk‑Sensitive Trading Framework for Real‑World Financial Markets

Bighead's Algorithm Notes

Jan 4, 2026 · Artificial Intelligence

How VTA Combines Large‑Model Reasoning for Precise and Explainable Stock Time‑Series Forecasting

The VTA framework integrates large language model reasoning with textual annotation of technical indicators, employs a Time‑GRPO reinforcement‑learning objective and multi‑stage joint conditional training, and achieves state‑of‑the‑art accuracy and expert‑rated interpretability on US, Chinese and European stock datasets.

LLMVTAexplainability

0 likes · 19 min read

How VTA Combines Large‑Model Reasoning for Precise and Explainable Stock Time‑Series Forecasting

PaperAgent

Dec 29, 2025 · Artificial Intelligence

Unveiling Bottom‑up Policy Optimization: Boosting LLM Reasoning with Internal Strategies

This article introduces Bottom‑up Policy Optimization (BuPO), a novel reinforcement‑learning framework that treats large language models as collections of internal layer and modular policies, revealing distinct inference entropy patterns in Llama and Qwen‑3 and demonstrating superior performance on complex mathematical reasoning benchmarks.

AI researchBottom-up OptimizationInternal Policy

0 likes · 10 min read

Unveiling Bottom‑up Policy Optimization: Boosting LLM Reasoning with Internal Strategies

Data Party THU

Dec 28, 2025 · Artificial Intelligence

How Causal Reinforcement Learning Is Shaping Robust, Explainable AI

This comprehensive survey examines the emerging field of Causal Reinforcement Learning, classifies its core techniques, introduces eleven benchmark environments, evaluates four novel algorithms, and outlines challenges and future research directions for building robust, generalizable, and interpretable AI systems.

AI robustnessalgorithm evaluationbenchmark environments

0 likes · 12 min read

How Causal Reinforcement Learning Is Shaping Robust, Explainable AI

DataFunTalk

Dec 25, 2025 · Artificial Intelligence

How DeepAgent Redefines General AI Reasoning with Scalable Toolsets

DeepAgent, a new end‑to‑end reasoning agent, integrates autonomous thinking, dynamic tool search, and execution to handle over 16,000 APIs, embodied tasks, and research assistance, achieving state‑of‑the‑art performance on benchmarks like TMDB, ToolBench, ALFWorld, WebShop, and GAIA.

Large Language ModelMemory Managementreasoning

0 likes · 15 min read

How DeepAgent Redefines General AI Reasoning with Scalable Toolsets

Tencent Advertising Technology

Dec 25, 2025 · Artificial Intelligence

How RAVEN Leverages Reinforcement Reasoning for Precise Ad Video Violation Grounding

RAVEN is a reinforcement‑reasoning framework that combines curriculum learning with hierarchical rewards to enable multimodal large language models to accurately locate and classify violation segments in advertisement videos, even under noisy, large‑scale industrial data.

Advertisingcurriculum-learningmultimodal LLM

0 likes · 17 min read

How RAVEN Leverages Reinforcement Reasoning for Precise Ad Video Violation Grounding

PaperAgent

Dec 23, 2025 · Artificial Intelligence

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

CATArena introduces a tournament‑style evaluation framework where AI agents iteratively code, compete, and improve across classic board games, using three‑dimensional quantitative scores to measure strategy programming, global learning, and generalization, and reveals how different LLM‑based agents learn and adapt over multiple rounds.

AI benchmarkAgent evaluationCATArena

0 likes · 8 min read

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

AI Info Trend

Dec 23, 2025 · Industry Insights

How AI Will Boost Collective Productivity: Key Takeaways from Microsoft’s 2025 Future of Work Report

Microsoft’s 2025 New Future of Work report reveals that AI, driven by breakthroughs in reinforcement learning, is shifting from answering questions to executing complex tasks, while investment and corporate adoption surge unevenly and employee involvement emerges as a critical factor for sustainable productivity gains.

AIGenerative AIMicrosoft Report

0 likes · 8 min read

How AI Will Boost Collective Productivity: Key Takeaways from Microsoft’s 2025 Future of Work Report

Bilibili Tech

Dec 19, 2025 · Artificial Intelligence

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

SABER introduces a reinforcement‑learning framework that lets large language models dynamically switch among four token‑budgeted reasoning modes, dramatically cutting inference length while preserving or improving accuracy across math, code, and logic tasks.

Budgeted ComputationChain-of-ThoughtEfficient Reasoning

0 likes · 13 min read

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

Instant Consumer Technology Team

Dec 16, 2025 · Artificial Intelligence

How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power

This article explains how the Mind Lab team tackled the challenges of training a 1‑trillion‑parameter mixture‑of‑experts model for agentic memory using reinforcement learning, LoRA, and a custom Megatron‑Bridge architecture, achieving a ten‑fold speedup while consuming just a fraction of the usual GPU resources.

AIAgentic AppsLoRA

0 likes · 9 min read

How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power

Network Intelligence Research Center (NIRC)

Dec 15, 2025 · Artificial Intelligence

Turning LLM-Generated Network Configurations into Verified, Safe Updates with Artanis

The paper introduces Artanis, an intent‑based network configuration update framework that combines large‑language‑model generation with a verification‑feedback loop and reinforcement‑learning optimization, addressing hallucination‑induced errors and ensuring safe, policy‑compliant deployments across diverse network scales.

Intent-based NetworkingLLMconfiguration management

0 likes · 9 min read

Turning LLM-Generated Network Configurations into Verified, Safe Updates with Artanis

Bighead's Algorithm Notes

Dec 13, 2025 · Artificial Intelligence

Key Quantitative Finance Papers (Dec 6‑12 2025) – AI‑Driven Insights

This article summarizes ten recent arXiv papers (Dec 6‑12 2025) that explore AI‑driven techniques—from neural‑network ranking and reinforcement learning to quantum models and LLM agents—for quantitative finance and investment decision‑making.

cryptocurrencyfinancial networkslarge language models

0 likes · 18 min read

Key Quantitative Finance Papers (Dec 6‑12 2025) – AI‑Driven Insights

AntTech

Dec 11, 2025 · Artificial Intelligence

Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

This article explains how the open‑source AReaL framework boosts large‑scale reinforcement learning by separating agent execution from training logic, introducing a decoupled Agentic RL service and a Single‑Controller architecture that improves data flow, fault tolerance, and GPU utilization.

Agentic AIOpen-sourceScalable RL

0 likes · 14 min read

Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

AI Frontier Lectures

Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models

0 likes · 12 min read

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

Data Party THU

Dec 9, 2025 · Artificial Intelligence

Can Robots Learn Human Moves Directly from AI‑Generated Videos? The GenMimic Breakthrough

The GenMimic paper introduces a novel framework that enables humanoid robots to zero‑shot imitate human actions generated by AI video models, presenting a new dataset, a two‑stage 4D reconstruction pipeline, and a reinforcement‑learning strategy with weighted‑tracking and symmetry losses, validated in simulation and on a real 23‑DoF robot.

humanoid robotsreinforcement learningrobotics

0 likes · 11 min read

Can Robots Learn Human Moves Directly from AI‑Generated Videos? The GenMimic Breakthrough

Baidu Tech Salon

Dec 8, 2025 · Artificial Intelligence

How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction

The article details Baidu HuiBosheng's end‑to‑end AI live‑streaming platform, covering merchant workflow, multimodal product understanding, style‑aware script generation, reinforcement‑learning‑driven smart control, voice and avatar cloning, and a data‑flywheel that continuously improves model performance, illustrated with real‑world GMV results.

AIData FlywheelLive Streaming

0 likes · 20 min read

How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction

Bighead's Algorithm Notes

Dec 7, 2025 · Artificial Intelligence

AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading

AlphaQuanter tackles the three major limitations of existing LLM trading agents by introducing a single‑agent framework that dynamically orchestrates market tools, learns transparent decision policies via reinforcement learning, and achieves state‑of‑the‑art performance on key financial metrics across extensive stock‑level experiments.

AlphaQuanterLLM Agentfinancial AI

0 likes · 13 min read

AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading

Baobao Algorithm Notes

Dec 7, 2025 · Artificial Intelligence

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Over recent months of extensive agent reinforcement‑learning experiments across search, data‑analysis, and multi‑source scenarios, the author shares twelve practical insights covering stability, environment‑reward‑algorithm priorities, tool‑call reliability, reward hacking pitfalls, evaluation alignment, and scaling tricks for larger models.

PPO EWMARL scalingTool Integration

0 likes · 7 min read

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Baobao Algorithm Notes

Dec 7, 2025 · Artificial Intelligence

Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings

This article critically examines recent RL‑for‑LLM studies, revealing that reinforcement learning improves search efficiency but does not extend the intrinsic reasoning capabilities of base models, and explores the underlying model‑conditioned optimization bias, comparisons with SFT distillation, and the trade‑off with catastrophic forgetting.

Catastrophic ForgettingLLMModel Optimization

0 likes · 11 min read

Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings

Amazon Cloud Developers

Dec 4, 2025 · Artificial Intelligence

Building Reliable, Efficient AI Agents: Key Takeaways from Swami’s re:Invent 2025 Talk

Swami Sivasubramanian’s re:Invent 2025 keynote outlines a four‑pillar framework—Easy to Build, Efficiency, Trust, Reliability—to move AI agents from POC jail to production, detailing innovations such as Strands Agents SDK, Amazon Nova Act’s 90% reliability, Bedrock’s +66% fine‑tuning accuracy, Episodic Memory, and Reinforcement Fine‑Tuning, all backed by real‑world demos and benchmark results.

AI agentsAWSAgentic AI

0 likes · 16 min read

Building Reliable, Efficient AI Agents: Key Takeaways from Swami’s re:Invent 2025 Talk

AntTech

Dec 4, 2025 · Artificial Intelligence

How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds

AState is a general‑purpose state data management system for reinforcement‑learning tasks that tackles low IO efficiency, slow weight synchronization, and state‑recovery challenges, achieving sub‑10‑second weight sync for trillion‑parameter models through a three‑layer architecture, zero‑redundancy transfers, and hardware‑aware co‑design, with the code openly available on GitHub.

AStateHigh-performance computinglarge models

0 likes · 23 min read

How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds

Model Perspective

Dec 1, 2025 · Artificial Intelligence

From AI to Everyday Life: How Reinforcement Learning Shapes Our Choices

This article explains the core concepts of reinforcement learning, illustrates how its reward‑based mechanism appears in media creation, career advancement, education and social media, and warns of the pitfalls of over‑optimizing external rewards while offering practical ways to balance intrinsic motivation and reflective thinking.

MotivationSelf‑Improvementartificial-intelligence

0 likes · 12 min read

From AI to Everyday Life: How Reinforcement Learning Shapes Our Choices

PaperAgent

Dec 1, 2025 · Artificial Intelligence

How Deep Research Turns LLMs into Autonomous AI Scientists

This article surveys the emerging Deep Research (DR) paradigm that upgrades large language models into research agents capable of autonomous planning, multi‑source evidence gathering, memory management, and verifiable long‑form report generation, outlining its stages, core components, training pipeline, and evaluation benchmarks.

AI agentsAI research automationDeep Research

0 likes · 6 min read

How Deep Research Turns LLMs into Autonomous AI Scientists

Data Party THU

Nov 29, 2025 · Artificial Intelligence

Unlocking AI Agents: From Fundamentals to Building Your First LLM‑Powered Agent

This comprehensive guide explores the concept of AI agents, detailing their definitions, classifications, and core interaction loops, then walks you through building a functional LLM‑driven travel assistant with step‑by‑step code, tool integration, and practical insights on agent versus workflow paradigms.

AI agentsLLMTool Integration

0 likes · 39 min read

Unlocking AI Agents: From Fundamentals to Building Your First LLM‑Powered Agent

Bighead's Algorithm Notes

Nov 28, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Nov 22‑28, 2025)

This digest summarizes five recent arXiv papers on AI-driven portfolio optimization and financial time‑series forecasting, covering G‑Learning with GIRL, transfer‑learning strategies, hybrid LSTM‑PPO frameworks, time‑series foundation models, and a KAN versus LSTM performance comparison, highlighting their methods, datasets, and reported Sharpe improvements.

financial AIportfolio optimizationreinforcement learning

0 likes · 9 min read

Weekly Quantitative Finance Paper Digest (Nov 22‑28, 2025)

Tencent Advertising Technology

Nov 28, 2025 · Artificial Intelligence

How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Retrv‑R1, a reasoning‑driven multimodal large language model framework, tackles the precision‑efficiency dilemma of universal multimodal retrieval by introducing a two‑stage coarse‑to‑fine pipeline, an information‑compression module, a detail‑inspection mechanism, and a three‑stage training strategy, achieving SOTA performance across accuracy, efficiency, and generalization benchmarks.

EfficiencyMLLMdetail inspection

0 likes · 21 min read

How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Alimama Tech

Nov 26, 2025 · Artificial Intelligence

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

Alibaba’s open‑source ROCK environment sandbox and the ROLL reinforcement‑learning engine together provide a standardized, high‑throughput training loop that lets developers scale Agentic AI from a single machine to thousands of parallel instances while simplifying debugging and resource management.

Agentic AIScalable Traininginfrastructure

0 likes · 12 min read

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

ITPUB

Nov 24, 2025 · Artificial Intelligence

Why Memory, Not Size, Is the Next Bottleneck for Large Language Models

In a detailed interview, the CTO of Memory Tensor (Shanghai) explains how limited memory capacity hampers large models, outlines the MemOS memory operating system, discusses information‑theoretic metrics, multimodal extensions, and reinforcement‑learning strategies for scalable, secure, and explainable AI memory management.

AI ArchitectureMultimodal AIinformation theory

0 likes · 23 min read

Why Memory, Not Size, Is the Next Bottleneck for Large Language Models

Data Party THU

Nov 23, 2025 · Artificial Intelligence

Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

This article walks through the fundamentals of reinforcement learning, builds a custom drone‑landing simulation, defines state and action spaces, designs reward functions, implements a neural‑network policy with Bernoulli sampling, and trains it using REINFORCE with baseline techniques, while exposing common pitfalls such as reward‑cheating.

OpenAI GymPythondrone landing

0 likes · 22 min read

Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

AntTech

Nov 21, 2025 · Artificial Intelligence

How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models

Awex is a high‑performance Python framework that synchronizes training and inference weights for trillion‑parameter reinforcement‑learning models in seconds, using unified conversion, metadata management, and NCCL/RDMA transfer plans, dramatically reducing RL training latency and supporting diverse parallel strategies.

High-performance computingPythondistributed training

0 likes · 17 min read

How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models

Xiaohongshu Tech REDtech

Nov 20, 2025 · Artificial Intelligence

How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools

DeepAgent is a new end‑to‑end reasoning agent that unifies autonomous thinking, dynamic tool search, and execution, handling over 16,000 real APIs, supporting embodied environments and research assistance, and achieving state‑of‑the‑art results across multiple benchmarks through its unified reasoning core, memory‑folding mechanisms, structured memory, and the ToolPO training framework.

AI agentsGeneral AITool Integration

0 likes · 14 min read

How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools

360 Zhihui Cloud Developer

Nov 20, 2025 · Artificial Intelligence

How DeepAgent Redefines AI Agents with Memory Folding and ToolPO

This article breaks down the DeepAgent paper, explaining its novel "main model + auxiliary model" architecture, the memory‑folding mechanism that compresses long‑context reasoning, and the ToolPO reinforcement strategy that enables efficient tool discovery and usage.

AI agentsToolPOlarge language models

0 likes · 8 min read

How DeepAgent Redefines AI Agents with Memory Folding and ToolPO

Baobao Algorithm Notes

Nov 20, 2025 · Artificial Intelligence

Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning

The article analyzes why reinforcement learning (RL) fine‑tuning retains a large language model's general abilities better than supervised fine‑tuning (SFT), explaining the off‑policy distribution shift of SFT and the on‑policy data consistency, KL penalty, and trust‑region mechanisms that give RL its anti‑forgetting properties.

Catastrophic ForgettingLLMOn-Policy Data

0 likes · 8 min read

Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning

Instant Consumer Technology Team

Nov 19, 2025 · Artificial Intelligence

How We Built an AI‑Powered Automated Video Editing Pipeline for Short‑Form Marketing

This article details the end‑to‑end AIGC video automation system we created—from raw material ingestion and multimodal content understanding to script generation, AI‑driven editing, rendering, and multi‑channel distribution—highlighting architecture, key modules, technical choices, performance results, and lessons learned.

AIGCMultimodal AIScript Generation

0 likes · 16 min read

How We Built an AI‑Powered Automated Video Editing Pipeline for Short‑Form Marketing

AI Tech Publishing

Nov 17, 2025 · Artificial Intelligence

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

The article evaluates nine cutting‑edge AI models on 150 simulated workplace tasks, showing that even the strongest models complete fewer than 40% of tasks, and uses these results to propose a hierarchical framework of agentic capabilities ranging from tool use to common‑sense reasoning.

AI model evaluationTool Useagentic capabilities

0 likes · 19 min read

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

Data Party THU

Nov 15, 2025 · Artificial Intelligence

How Reinforcement Learning Powers Intelligent AI Agents and LangGraph Workflows

This article explains how reinforcement learning (RL) underpins intelligent AI agents, covering the Markov Decision Process fundamentals, key RL components, multi‑hop reasoning on knowledge graphs, and a step‑by‑step LangGraph example that integrates an RL‑driven tutoring policy with Python code.

AI agentsKnowledge GraphLangGraph

0 likes · 17 min read

How Reinforcement Learning Powers Intelligent AI Agents and LangGraph Workflows

Kuaishou Tech

Nov 14, 2025 · Artificial Intelligence

How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators

This article explains the over‑optimization problem in GRPO‑based flow models, analyzes why importance‑ratio clipping fails, and introduces GRPO‑Guard with RatioNorm and cross‑step gradient balancing, showing through extensive experiments that it stabilizes training and improves image quality across multiple diffusion backbones and tasks.

GRPO-GuardGenerative AIflow matching

0 likes · 9 min read

How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators

Smart Era Software Development

Nov 14, 2025 · Artificial Intelligence

AsyncThink: How Microsoft’s Agentic Organization Turns LLMs into Project Managers

The paper introduces AsyncThink, a novel "agentic organization" paradigm that lets large language models dynamically fork, join, and coordinate multiple reasoning agents, achieving higher accuracy and lower latency than traditional chain‑of‑thought or parallel‑thinking approaches across math, Sudoku, graph, and genetics tasks.

Agentic OrganizationAsyncThinkFork‑Join

0 likes · 8 min read

AsyncThink: How Microsoft’s Agentic Organization Turns LLMs into Project Managers

Bighead's Algorithm Notes

Nov 13, 2025 · Artificial Intelligence

Paper Review: AlphaGAT’s Two‑Stage Learning for Adaptive Portfolio Selection

AlphaGAT introduces a two‑stage learning framework that first extracts robust alpha factors with a CATimeMixer model and a novel loss, then dynamically weights these factors via reinforcement learning (PPO) and a graph attention network, achieving superior portfolio performance across DJIA, HSI, CSI‑100 and crypto markets despite noisy data and distribution shifts.

AlphaGATfinancial AIgraph attention network

0 likes · 14 min read

Paper Review: AlphaGAT’s Two‑Stage Learning for Adaptive Portfolio Selection

Alimama Tech

Nov 11, 2025 · Artificial Intelligence

Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards

This article introduces the 3A collaborative framework—Async architecture, Asymmetric PPO mini‑critics, and an attention‑based reasoning rhythm—demonstrating how decoupled, fine‑grained parallel training and structure‑aware reward allocation dramatically improve efficiency, scalability, and interpretability of reinforcement learning for large language models.

Asynchronous Trainingattention mechanismslarge language models

0 likes · 23 min read

Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards

DataFunTalk

Nov 7, 2025 · Artificial Intelligence

Training-Free GRPO: Low‑Cost Reinforcement Learning for Large Language Models

Training-Free GRPO, proposed by Tencent Youtu Lab, eliminates parameter updates by iteratively building an experience knowledge base, enabling cost‑effective reinforcement learning for large language models, dramatically reducing training expenses from thousands of dollars to under $20 while maintaining strong performance across math reasoning and web search tasks.

AIcost reductionreinforcement learning

0 likes · 6 min read

Training-Free GRPO: Low‑Cost Reinforcement Learning for Large Language Models

Architect's Guide

Nov 7, 2025 · Artificial Intelligence

Why Multi-Agent Communication Protocols Are Crucial for Next-Gen AI

The article examines the need for Multi‑Agent Communication Protocols (MCP), outlines the limitations of single‑agent and centralized systems, compares MCP with other interaction methods, reviews current research and industrial applications, and highlights future directions such as hardware integration, bio‑inspired mechanisms, and blockchain convergence.

Graph Neural NetworksMulti-Agent Systemsblockchain

0 likes · 9 min read

Why Multi-Agent Communication Protocols Are Crucial for Next-Gen AI

Kuaishou Tech

Nov 5, 2025 · Artificial Intelligence

How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy

This article explains the overthinking problem of large language models, introduces the HiPO framework with hybrid data cold‑start and reinforcement‑learning reward mechanisms that let models decide when to think deeply or answer directly, and shows experimental results demonstrating significant efficiency gains and accuracy improvements across multiple benchmarks.

EfficiencyHybrid Policy OptimizationLLM

0 likes · 13 min read

How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy

Network Intelligence Research Center (NIRC)

Nov 4, 2025 · Artificial Intelligence

SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously

SEAgent introduces a self‑evolving framework that enables a GUI agent to master unfamiliar software through autonomous exploration and experience learning, leveraging a curriculum generator, a world‑state model, and GRPO‑based reinforcement with adversarial imitation, achieving state‑of‑the‑art performance on OSWorld.

GUI automationSEAgentautonomous learning

0 likes · 6 min read

SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously

DataFunSummit

Nov 3, 2025 · Artificial Intelligence

Boosting Private Agentic AI: LLM Post‑Training, DPO, and End‑to‑End Evaluation

This article shares practical experience on deploying private Agentic AI, covering background, architecture design, challenges, data generation, reinforcement learning with DPO, automated multi‑dimensional evaluation, and future plans for open‑source models and richer tool integration.

Agentic AIDPOLLM fine-tuning

0 likes · 16 min read

Boosting Private Agentic AI: LLM Post‑Training, DPO, and End‑to‑End Evaluation

Data Party THU

Oct 31, 2025 · Artificial Intelligence

How SPG’s Sandwich Gradient Boosts Diffusion Language Models Across Four Benchmarks

The SPG algorithm introduces a sandwiched policy gradient that uses computable lower and upper evidence bounds to align reinforcement learning for discrete diffusion language models, achieving faster convergence, higher peaks, and lower variance on four major reasoning benchmarks.

Diffusion language modelEUBOSPG

0 likes · 9 min read

How SPG’s Sandwich Gradient Boosts Diffusion Language Models Across Four Benchmarks

Bilibili Tech

Oct 31, 2025 · Artificial Intelligence

RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization) introduces an adversarial game between a reward model and a translation LLM, combining qualitative preference rewards with quantitative metrics like BLEU, to overcome distribution shift in RLHF and achieve superior performance on conversational subtitle and WMT translation tasks.

BLEULLMMachine Translation

0 likes · 13 min read

RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

Baobao Algorithm Notes

Oct 31, 2025 · Artificial Intelligence

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

This article analyzes why standard reinforcement learning can degrade Pass@K metrics after fine‑tuning large language models, introduces a risk‑sensitive RL objective that reshapes the advantage estimator, and demonstrates through bandit and mathematical‑reasoning experiments that the RS‑GRPO method consistently boosts diversity and overall Pass@K scores across multiple LLMs.

Exploration-ExploitationLLM fine-tuningRS-GRPO

0 likes · 12 min read

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

Baobao Algorithm Notes

Oct 31, 2025 · Artificial Intelligence

Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study

Meta’s recent paper reveals a sigmoid‑shaped scaling law for LLM reinforcement learning, presents extensive 40‑k GPU‑hour experiments, compares various RL designs such as PPO‑off‑policy‑k and Pipeline‑RL‑k, and distills the findings into a practical “ScaleRL” recipe that improves performance and efficiency.

LLMRL OptimizationScaling Law

0 likes · 10 min read

Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study

DataFunTalk

Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

Model EfficiencyOn‑Policy Distillationknowledge distillation

0 likes · 15 min read

How On-Policy Distillation Cuts LLM Training Cost by 90%

Baobao Algorithm Notes

Oct 30, 2025 · Artificial Intelligence

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

AILLMReward Modeling

0 likes · 11 min read

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

Instant Consumer Technology Team

Oct 28, 2025 · Artificial Intelligence

How 7B AgentFlow Beats 200B GPT-4o: Small Models, Big Wins

AgentFlow, a Stanford-led multi‑agent system built on a 7B model, outperforms massive models like GPT‑4o across ten benchmarks by leveraging modular agents, on‑policy learning, and a novel Flow‑GRPO training engine that solves sparse‑reward, long‑horizon challenges.

AgentFlowMulti-Agent SystemsSmall Model Performance

0 likes · 12 min read

How 7B AgentFlow Beats 200B GPT-4o: Small Models, Big Wins

Data Party THU

Oct 24, 2025 · Artificial Intelligence

BREEZE: Enhancing Zero‑Shot Reinforcement Learning with Behavioral Regularization

The paper introduces BREEZE, a behavior‑regularized zero‑shot RL framework that improves stability, policy extraction, and representation quality by combining in‑sample learning, task‑conditioned diffusion models, and expressive attention‑based architectures, achieving near‑state‑of‑the‑art performance on benchmarks like ExORL and D4RL Kitchen.

Offline RLbehavioral regularizationdiffusion model

0 likes · 3 min read

BREEZE: Enhancing Zero‑Shot Reinforcement Learning with Behavioral Regularization

Data Party THU

Oct 22, 2025 · Artificial Intelligence

Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions

This article provides a comprehensive introduction to reinforcement learning for large language models, covering the Markov Decision Process formulation, the four core elements of RL, state‑value and action‑value functions, Bellman equations, and the advantage function that underpins modern policy‑gradient algorithms.

AI FundamentalsBellman equationLarge Language Model

0 likes · 13 min read

Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions

Data Party THU

Oct 21, 2025 · Artificial Intelligence

Why DQN Overestimates Q‑Values and How Double DQN Fixes It

The article explains how DQN’s use of the max operator introduces a maximization bias that leads to overestimated Q‑values, and shows how Double DQN separates action selection from value evaluation to eliminate this bias, improving stability and performance in Atari benchmarks.

DQNDouble DQNalgorithm analysis

0 likes · 7 min read

Why DQN Overestimates Q‑Values and How Double DQN Fixes It

Data Thinking Notes

Oct 19, 2025 · Artificial Intelligence

How GSPO Improves Stability in Large Language Model Training

GSPO (Group Sequence Policy Optimization) is a reinforcement‑learning algorithm for LLMs that replaces token‑level GRPO with sequence‑level optimization, addressing instability in ultra‑large model training, especially for long‑sequence and MoE architectures, by aligning reward granularity and reducing variance.

GRPOGSPOlarge language models

0 likes · 11 min read

How GSPO Improves Stability in Large Language Model Training

Bilibili Tech

Oct 17, 2025 · Artificial Intelligence

How Bilibili’s Multimodal Team Won 2nd Place at ICCV MIPI with a Novel SFT+GRPO Strategy

This article details how Bilibili’s multimedia lab leveraged a multimodal training pipeline combining data‑compressed SFT and the GRPO reinforcement‑learning algorithm to achieve a 13.5% metric boost and secure second place in the ICCV MIPI Detailed Image Quality Assessment competition.

GRPOMIPI competitionSFT

0 likes · 15 min read

How Bilibili’s Multimodal Team Won 2nd Place at ICCV MIPI with a Novel SFT+GRPO Strategy

Xiaohe Frontend Team

Oct 15, 2025 · Artificial Intelligence

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meta’s new REFRAG framework lets a lightweight encoder compress retrieved text into semantic tags, enabling large language models to answer queries with far fewer tokens, lower latency, and higher throughput, while preserving core meaning and allowing flexible placement of compressed information within prompts.

LLM efficiencyRAGmodel compression

0 likes · 8 min read

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meituan Technology Team

Oct 15, 2025 · Artificial Intelligence

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

This curated list showcases Meituan’s latest large‑model breakthroughs and academic papers up to October 2025, spanning LLM system optimizations, multimodal generation, evaluation benchmarks, quantization techniques, and reinforcement‑learning‑driven improvements, offering researchers valuable insights and resources across the AI landscape.

AI researchBenchmarkingMultimodal AI

0 likes · 10 min read

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025