Tagged articles
658 articles
Page 2 of 7
AI Explorer
AI Explorer
Mar 17, 2026 · Artificial Intelligence

RISE Enables Breakthrough in Vision‑Language‑Action Learning for Embodied AI

The article examines the limitations of vision‑language‑action (VLA) models in real‑world tasks, explains how the RISE technique from Hong Kong University uses internal simulation, reflection and imagination to cut training costs by an order of magnitude, and discusses its implications for future embodied AI.

Embodied AIRISERobotics
0 likes · 6 min read
RISE Enables Breakthrough in Vision‑Language‑Action Learning for Embodied AI
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 15, 2026 · Artificial Intelligence

Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods

The MIT‑CSAIL paper introduces RandOpt, a single‑step, gradient‑free, fully parallel post‑training algorithm that adds Gaussian noise to pretrained LLM weights and ensembles the results, achieving or surpassing PPO/GRPO performance by exploiting dense "neural thickets" that emerge as model scale grows.

LLMRandOptensemble
0 likes · 12 min read
Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods
SuanNi
SuanNi
Mar 12, 2026 · Artificial Intelligence

How OpenClaw‑RL Turns Everyday Interactions into Self‑Evolving AI

OpenClaw‑RL, a new reinforcement‑learning framework from Princeton, captures hidden evaluative and instructional signals in daily user interactions, converts them into real‑time training data, and uses a decoupled asynchronous architecture with binary RL and online policy distillation to achieve superior performance in both personal‑device and cloud‑scale scenarios.

AI FeedbackAsynchronous ArchitectureOnline Distillation
0 likes · 10 min read
How OpenClaw‑RL Turns Everyday Interactions into Self‑Evolving AI
AIWalker
AIWalker
Mar 12, 2026 · Artificial Intelligence

BeautyGRPO: RL‑Driven Realistic Portrait Retouching Ends Over‑Beautification (CVPR 2026)

The paper introduces BeautyGRPO, a reinforcement‑learning framework that combines a fine‑grained preference dataset (FRPref‑10K) with Dynamic Path Guidance to balance aesthetic enhancement and high‑fidelity preservation in portrait retouching, achieving superior metrics and user preference over existing SFT and RL models.

AI aestheticsCVPR 2026dynamic path guidance
0 likes · 11 min read
BeautyGRPO: RL‑Driven Realistic Portrait Retouching Ends Over‑Beautification (CVPR 2026)
Didi Tech
Didi Tech
Mar 12, 2026 · Artificial Intelligence

How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens

The STAPO (Spurious‑Token‑Aware Policy Optimization) algorithm, introduced by Tsinghua University's iDLab and Didi's Deep Sea Lab, tackles policy‑entropy instability and performance oscillation in reinforcement‑learning fine‑tuning of large models by mathematically analyzing token collision probability, defining spurious tokens, and applying a Silencing Spurious Tokens mechanism that yields state‑of‑the‑art results on multiple math‑reasoning benchmarks.

AI SafetyFine-tuningLarge Model
0 likes · 7 min read
How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens
DataFunTalk
DataFunTalk
Mar 11, 2026 · Artificial Intelligence

Agent Lightning: Decoupling Optimizers to Empower AI Agents via Reinforcement Learning

Agent Lightning, an open‑source system from Microsoft Research Asia, introduces a novel optimizer‑agent disaggregation architecture that enables any AI agent to benefit from reinforcement learning, offering non‑intrusive experience capture, programmable pipelines, and flexible signal passing, while addressing real‑world challenges of scalability, multi‑step tasks, and zero‑code integration.

Agent LightningExperience CaptureLearning Systems
0 likes · 21 min read
Agent Lightning: Decoupling Optimizers to Empower AI Agents via Reinforcement Learning
DataFunSummit
DataFunSummit
Mar 10, 2026 · Artificial Intelligence

How Agent Lightning Redefines AI Agent Learning with Optimizer‑Agent Decoupling

The article explores the paradigm shift toward AI agents in 2025, detailing the open‑source Agent Lightning project’s architecture, non‑intrusive experience capture, programmable pipelines, and experimental results that demonstrate its ability to enable reinforcement learning for any agent with minimal code changes.

Agent LightningOpen‑source Frameworkmachine learning
0 likes · 20 min read
How Agent Lightning Redefines AI Agent Learning with Optimizer‑Agent Decoupling
PaperAgent
PaperAgent
Mar 10, 2026 · Artificial Intelligence

How MemSifter Delivers High‑Precision, Low‑Cost Long‑Term Memory for LLMs

MemSifter introduces a lightweight agent that outsources memory retrieval for large language models, using a Think‑and‑Rank pipeline and a task‑result‑oriented reinforcement‑learning training paradigm to achieve superior retrieval accuracy and efficiency across eight benchmark tasks while keeping inference overhead minimal.

AgentBenchmarkLLM
0 likes · 13 min read
How MemSifter Delivers High‑Precision, Low‑Cost Long‑Term Memory for LLMs
Tencent Cloud Developer
Tencent Cloud Developer
Mar 5, 2026 · Artificial Intelligence

20 Cutting‑Edge RAG Optimization Techniques: From Semantic Chunking to Self‑RAG

This article systematically presents twenty practical RAG (Retrieval‑Augmented Generation) optimization methods—covering semantic chunking, chunk‑size evaluation, context‑enhanced retrieval, query transformation, re‑ranking, feedback loops, multimodal and graph RAG, hierarchical retrieval, HyDE, Self‑RAG and reinforcement‑learning‑enhanced RAG—each with clear Python code examples, advantages, limitations and ideal use‑cases.

AILLMRAG
0 likes · 57 min read
20 Cutting‑Edge RAG Optimization Techniques: From Semantic Chunking to Self‑RAG
Kuaishou Tech
Kuaishou Tech
Mar 4, 2026 · Artificial Intelligence

How LLMs Are Revolutionizing Reinforcement Learning for Recommendation Systems

This survey examines the emerging LLM‑RL collaborative recommendation paradigm, outlining its research background, five main collaboration patterns, standardized evaluation protocols, and the key challenges and future directions for building smarter, more robust recommender systems.

LLMRecommendation Systemsartificial intelligence
0 likes · 14 min read
How LLMs Are Revolutionizing Reinforcement Learning for Recommendation Systems
Woodpecker Software Testing
Woodpecker Software Testing
Mar 4, 2026 · Artificial Intelligence

Deep Dive into Adversarial Testing Performance Optimization for AI Systems

The article examines Adversarial Testing Performance Optimization (ATPO) as a new industrial-quality paradigm, detailing how adversarial samples expose hidden performance bottlenecks across AI pipelines, presenting three typical adversarial loads with corresponding optimization targets, common implementation pitfalls, and emerging intelligent approaches using reinforcement learning and digital twins.

AI pipelinesDigital TwinPerformance Optimization
0 likes · 8 min read
Deep Dive into Adversarial Testing Performance Optimization for AI Systems
PaperAgent
PaperAgent
Mar 3, 2026 · Artificial Intelligence

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

The article presents CharacterFlywheel, a 15‑generation flywheel methodology that iteratively improves social‑dialogue LLMs in production using data‑driven reward models, rejection sampling, and a mix of SFT, DPO, and RL, with detailed experiments and best‑practice insights.

AI SafetyLLM optimizationReward Modeling
0 likes · 12 min read
How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization
PaperAgent
PaperAgent
Mar 2, 2026 · Artificial Intelligence

SKILLRL: Boosting LLM Agents with Skill Distillation and Recursive Evolution

SKILLRL introduces a novel framework that transforms raw LLM agent trajectories into compact, reusable skills via experience‑driven distillation, hierarchical skill banks, and recursive skill evolution, achieving up to 90% success on ALFWorld and 73% on WebShop while reducing token usage by over 10% compared to memory‑based baselines.

LLM agentsSKILLRLhierarchical skill bank
0 likes · 10 min read
SKILLRL: Boosting LLM Agents with Skill Distillation and Recursive Evolution
AI Explorer
AI Explorer
Mar 2, 2026 · Artificial Intelligence

OpenSandbox: Alibaba’s Open‑Source AI Sandbox for Secure, Scalable Agent Execution

OpenSandbox, an open‑source sandbox platform from Alibaba, offers a unified, secure, and extensible execution environment for AI agents, code execution, and reinforcement‑learning workloads, leveraging Docker and high‑performance Kubernetes runtimes, with multi‑language SDKs and fine‑grained network controls.

AI agentsAI sandboxDocker
0 likes · 7 min read
OpenSandbox: Alibaba’s Open‑Source AI Sandbox for Secure, Scalable Agent Execution
AI Frontier Lectures
AI Frontier Lectures
Feb 28, 2026 · Artificial Intelligence

Can Reinforcement Learning Revolutionize Text-to-3D Generation? A Deep Dive

This article presents a systematic investigation of applying reinforcement learning to text‑to‑3D generation, detailing reward design, algorithm selection, a new 3D benchmark, a hierarchical GRPO framework, extensive ablations, and the resulting performance gains and limitations.

AI researchGenerative Modelsreinforcement learning
0 likes · 13 min read
Can Reinforcement Learning Revolutionize Text-to-3D Generation? A Deep Dive
Baobao Algorithm Notes
Baobao Algorithm Notes
Feb 24, 2026 · Artificial Intelligence

The Bitter Lesson of Building Agentic RL in Terminal Environments

This article recounts the challenges of moving from single‑step RL with verifiable rewards to multi‑step agentic reinforcement learning in terminal environments, detailing infrastructure design, asynchronous pipelines, data quality checks, masking strategies, curriculum training, chunk‑based optimization, and practical lessons learned from large‑scale experiments.

Credit AssignmentEnvironment AugmentationRL Infrastructure
0 likes · 33 min read
The Bitter Lesson of Building Agentic RL in Terminal Environments
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 23, 2026 · Artificial Intelligence

System Engineering Behind Billions of Parameters: Insider Training Details from Seven Top AI Labs

This article systematically dissects the engineering decisions behind frontier large‑language‑model training—covering architecture choices, attention variants, optimizer evolution, data‑curation strategies, scaling‑law insights, and post‑training SFT/RL pipelines—based on open‑source reports from seven leading AI laboratories.

Mixture of ExpertsModel Traininglarge language models
0 likes · 26 min read
System Engineering Behind Billions of Parameters: Insider Training Details from Seven Top AI Labs
HyperAI Super Neural
HyperAI Super Neural
Feb 19, 2026 · Artificial Intelligence

World Model & VLA Breakthroughs: Top Papers from NVIDIA, ByteDance, Tsinghua and Others

This roundup highlights six recent embodied AI papers that advance world models and vision‑language‑action (VLA) techniques, covering DreamDojo's massive first‑person video model, LingBot‑World simulator, Agent World Model generator, BagelVLA, ACoT‑VLA, and the closed‑loop World‑VLA‑Loop framework.

Embodied AIRoboticsSynthetic Environments
0 likes · 8 min read
World Model & VLA Breakthroughs: Top Papers from NVIDIA, ByteDance, Tsinghua and Others
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 18, 2026 · Artificial Intelligence

Microsoft’s 671B LLM Unifies Offline Ad Tasks—Can It Cut Compute Costs?

Microsoft’s AdNanny replaces a forest of specialized offline models with a single 671 B LLM, using a three‑stage data factory to generate reasoning‑rich corpora, dynamic task re‑weighting, RL‑based metric alignment, and a hybrid 31‑pipeline‑parallel architecture that halves compute cost while boosting performance on core ad‑ranking tasks.

AdNannyLLMLarge Model
0 likes · 9 min read
Microsoft’s 671B LLM Unifies Offline Ad Tasks—Can It Cut Compute Costs?
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 16, 2026 · Artificial Intelligence

Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide

This article provides a detailed analysis of Qwen3.5, covering its multimodal MoE design, massive inference speedups, extensive benchmark results against GPT‑5.2, Claude 4.5 Opus and Gemini‑3 Pro, RL scaling strategies, training infrastructure innovations, and practical usage via API and local deployment.

BenchmarkFP8 trainingMultimodal AI
0 likes · 13 min read
Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 15, 2026 · Artificial Intelligence

Embedding Error Correction into the Policy Space: How Search‑R2 Redefines Search‑Enhanced Reasoning

The Search‑R2 framework integrates error detection, localization, and correction into a reinforcement‑learning loop for search‑enhanced reasoning, achieving notably larger accuracy gains on difficult multi‑hop QA tasks than baseline methods, even when those baselines receive higher sampling budgets.

Agentic AIError CorrectionMulti-hop QA
0 likes · 15 min read
Embedding Error Correction into the Policy Space: How Search‑R2 Redefines Search‑Enhanced Reasoning
PaperAgent
PaperAgent
Feb 15, 2026 · Artificial Intelligence

Why Memory Is the Next Critical Infrastructure for AI Agents

This survey reviews over 200 papers to propose a three‑dimensional classification framework for foundation‑agent memory, analyzes paradigm shifts from model‑centric to utility‑centric AI, and outlines memory substrates, cognitive mechanisms, operation strategies, learning paradigms, evaluation metrics, applications, and future research directions.

AI agentsAgent ArchitectureMemory Mechanisms
0 likes · 10 min read
Why Memory Is the Next Critical Infrastructure for AI Agents
Top Architect
Top Architect
Feb 14, 2026 · Artificial Intelligence

Why Test‑Time Compute Is the Next Breakthrough for Large Language Models

The article explains how inference‑oriented large language models shift the focus from training‑time resources to test‑time computation, detailing scaling laws, verification techniques, reinforcement‑learning pipelines such as DeepSeek‑R1, and methods for distilling reasoning abilities into smaller, consumer‑grade models.

Prompt engineeringinference computelarge language models
0 likes · 19 min read
Why Test‑Time Compute Is the Next Breakthrough for Large Language Models
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Feb 12, 2026 · Artificial Intelligence

Deploying GLM-5 on Baidu Kunlun P800 XPU with vLLM‑Kunlun Plugin

This article explains how Baidu's new GLM-5 large model is adapted to the Kunlun P800 XPU, detailing the async reinforcement learning framework Slime, optimization techniques like INT8 quantization and tensor‑parallelism, and provides step‑by‑step deployment commands using the open‑source vLLM‑Kunlun plugin.

AI accelerationGLM-5INT8 Quantization
0 likes · 6 min read
Deploying GLM-5 on Baidu Kunlun P800 XPU with vLLM‑Kunlun Plugin
DeWu Technology
DeWu Technology
Feb 11, 2026 · Artificial Intelligence

How Generative Models Transform Re‑ranking Architecture for Faster, More Diverse Recommendations

This article examines the evolution of re‑ranking systems from traditional pointwise models to a two‑stage generation‑evaluation framework, compares autoregressive and non‑autoregressive generative approaches, details inference speed optimizations with GPU and model‑server upgrades, and outlines a future end‑to‑end sequence generation architecture enhanced by reinforcement learning and contrastive learning.

AIGenerative ModelsInference Optimization
0 likes · 14 min read
How Generative Models Transform Re‑ranking Architecture for Faster, More Diverse Recommendations
Ximalaya Technology Team
Ximalaya Technology Team
Feb 11, 2026 · Artificial Intelligence

How Ximalaya Used Generative AI to Revolutionize Audio Recommendations

This article details Ximalaya's journey from traditional multi‑stage recommendation pipelines to generative AI‑driven models, covering business challenges, architectural and model differences, phased deployments, knowledge distillation, semantic ID encoding, decoder‑only strategies, extensive offline and online evaluations, and future research directions.

Encoder-DecoderRecommendation Systemsaudio recommendation
0 likes · 24 min read
How Ximalaya Used Generative AI to Revolutionize Audio Recommendations
AI Frontier Lectures
AI Frontier Lectures
Feb 10, 2026 · Artificial Intelligence

Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens

FaithLens is an 8‑billion‑parameter model that surpasses GPT‑4.1 and other large models on twelve hallucination‑detection benchmarks while providing high‑quality natural‑language explanations, thanks to a novel data‑synthesis pipeline, three‑dimensional filtering, and rule‑based reinforcement learning.

LLM hallucinationefficient inferenceexplainable AI
0 likes · 12 min read
Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens
AI Frontier Lectures
AI Frontier Lectures
Feb 6, 2026 · Artificial Intelligence

Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?

The paper introduces Mixture‑of‑Visual‑Thoughts (MoVT), a context‑adaptive reasoning paradigm that integrates pure‑text and visually‑grounded inference modes within a single model, and presents the two‑stage AdaVaR training framework with a novel AdaGRPO reinforcement‑learning algorithm to automatically select the optimal mode for each visual‑language task, achieving consistent gains across eight benchmarks and surpassing strong baselines including GPT‑4o.

AdaVaRMixture-of-Visual-ThoughtsVisual Reasoning
0 likes · 16 min read
Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?
HyperAI Super Neural
HyperAI Super Neural
Feb 6, 2026 · Artificial Intelligence

Latest Advances in AI Agents: PaperBanana, SDPO, Lumine, Idea2Story, and Insight Agents

This weekly roundup highlights five recent AI agent papers—PaperBanana for automated academic illustration, SDPO's self‑distillation reinforcement learning, Lumine's open‑world generalist agent, Idea2Story's pipeline for turning research ideas into narratives, and Insight Agents' fast e‑commerce insights—showcasing diverse breakthroughs in multi‑agent frameworks, self‑feedback learning, and real‑world deployment.

AI agentsautomated scientific narrativemulti-agent systems
0 likes · 8 min read
Latest Advances in AI Agents: PaperBanana, SDPO, Lumine, Idea2Story, and Insight Agents
Alimama Tech
Alimama Tech
Feb 5, 2026 · Artificial Intelligence

Can Few-Shot Reinforcement Learning Supercharge Budget-Constrained Auto-Bidding?

This paper introduces ABPlanner, a few‑shot, context‑aware budget planner that enhances budget‑constrained auto‑bidding in online advertising by hierarchically allocating budgets across short‑term stages and training a sequential decision‑maker with deep reinforcement learning, achieving significant gains in simulated and real‑world A/B tests.

Few‑Shot Learningauto-biddingbudget allocation
0 likes · 13 min read
Can Few-Shot Reinforcement Learning Supercharge Budget-Constrained Auto-Bidding?
Baobao Algorithm Notes
Baobao Algorithm Notes
Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM optimizationLinear Attention
0 likes · 22 min read
Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks
Baidu Geek Talk
Baidu Geek Talk
Feb 2, 2026 · Artificial Intelligence

How Cloud AI Infra Powers the Next Wave of Embodied Intelligence

This article outlines the rapid rise of embodied intelligence, the explosion of Vision‑Language‑Action (VLA) research, and how cloud‑based AI infrastructure—including multi‑level IaaS, data pipelines, dual‑system model designs, and reinforcement‑learning workflows—addresses emerging scaling and deployment challenges.

VLAmultimodal modelsreinforcement learning
0 likes · 13 min read
How Cloud AI Infra Powers the Next Wave of Embodied Intelligence
Data Party THU
Data Party THU
Jan 31, 2026 · Artificial Intelligence

Can LLMs Learn While Being Tested? Inside the TTT-Discover Breakthrough

The article examines the Test‑Time Training to Discover (TTT‑Discover) approach, which applies reinforcement learning during inference to let large language models continuously improve on single test problems, and reports strong results across mathematics, GPU kernel optimization, algorithm design, and biology.

AI researchLLMScientific Discovery
0 likes · 9 min read
Can LLMs Learn While Being Tested? Inside the TTT-Discover Breakthrough
JD Tech
JD Tech
Jan 31, 2026 · Artificial Intelligence

How JD's 9N‑LLM Engine Powers Scalable Generative Recommendation at Massive Scale

This article details JD Retail's 9N‑LLM unified training framework that tackles the massive data, hardware heterogeneity, and algorithmic challenges of generative recommendation by integrating TensorFlow and PyTorch, supporting GPU/NPU, and delivering high‑throughput sample processing, sparse/dense optimization, and flexible reinforcement‑learning capabilities.

GPU/NPURaylarge-scale AI
0 likes · 26 min read
How JD's 9N‑LLM Engine Powers Scalable Generative Recommendation at Massive Scale
JD Retail Technology
JD Retail Technology
Jan 30, 2026 · Artificial Intelligence

How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale

The article details JD Retail’s 9N‑LLM unified training engine—supporting TensorFlow and PyTorch, GPU and NPU, and both traditional and generative recommendation scenarios—explaining its architecture, high‑throughput sample engine, distributed sparse embedding system, five‑stage pipeline, UniAttention accelerator, and reinforcement‑learning capabilities that together enable TB‑scale data, B‑scale dense parameters, and efficient RL training for real‑world recommendation services.

Distributed TrainingGPU/NPUUniAttention
0 likes · 26 min read
How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale
DaTaobao Tech
DaTaobao Tech
Jan 30, 2026 · Artificial Intelligence

Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling

This article proposes an ASR‑driven pipeline that creates high‑quality AI‑reply vs. human‑like reply pairs, trains a rewrite model and a reward model, and uses GRPO reinforcement learning to generate natural, helpful, and less AI‑sounding responses in digital‑human live streaming, achieving 92% accuracy and 97% helpfulness while improving user experience.

ASR dataLLMQwen
0 likes · 20 min read
Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling
AI Engineering
AI Engineering
Jan 30, 2026 · Artificial Intelligence

Why Letting LLMs Argue Improves Their Reasoning Quality

Google’s recent study of over 8,000 reasoning tasks shows that advanced LLMs like DeepSeek‑R1 spontaneously develop multiple internal “expert” personas that debate, and that activating a discovered “social switch” dramatically raises accuracy, revealing that engineered conflict can enhance AI reasoning.

AI debateFeature ControlLLM
0 likes · 8 min read
Why Letting LLMs Argue Improves Their Reasoning Quality
PaperAgent
PaperAgent
Jan 30, 2026 · Artificial Intelligence

How LLM‑in‑Sandbox Turns Large Models into General‑Purpose Agents Without Extra Training

The LLM‑in‑Sandbox framework places large language models inside a virtual machine that provides external tool access, persistent storage, and code execution, yielding up to a 24.2% performance boost across six benchmark tasks without additional training, and it scales from zero‑shot to reinforcement‑learning‑enhanced agents while remaining cost‑effective.

Agentic AILLMefficiency
0 likes · 6 min read
How LLM‑in‑Sandbox Turns Large Models into General‑Purpose Agents Without Extra Training
Meituan Technology Team
Meituan Technology Team
Jan 29, 2026 · Artificial Intelligence

How LongCat‑Flash‑Thinking‑2601 Achieves Real‑World Generalization for Agents

LongCat‑Flash‑Thinking‑2601, a 560‑billion‑parameter MoE model, combines environment expansion, multi‑environment RL, systematic noise training, a heavy‑thinking reasoning mode, and Zigzag sparse attention to deliver strong benchmark performance and robust real‑world agent capabilities.

Environment ExpansionZigzag Attentionagent training
0 likes · 14 min read
How LongCat‑Flash‑Thinking‑2601 Achieves Real‑World Generalization for Agents
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 28, 2026 · Artificial Intelligence

How We Built a High‑Performance AI Rental Advisor with One‑Model Tool‑Use and Reinforcement Learning

This article details the design, challenges, and performance gains of an AI‑driven rental recommendation system that replaces a multi‑agent architecture with a single LLM using dynamic tool‑use, introduces a two‑stage reinforcement‑learning pipeline, and achieves sub‑second latency and higher accuracy for complex rental scenarios.

AI recommendationSystem ArchitectureTool Use
0 likes · 19 min read
How We Built a High‑Performance AI Rental Advisor with One‑Model Tool‑Use and Reinforcement Learning
PaperAgent
PaperAgent
Jan 25, 2026 · Artificial Intelligence

How Deep GraphRAG Solves Retrieval’s Three‑Way Dilemma with Hierarchical Search

Deep GraphRAG tackles the three‑fold dilemma of traditional Retrieval‑Augmented Generation by introducing hierarchical global‑to‑local retrieval, a beam‑search dynamic reordering that cuts latency, and a DW‑GRPO reinforcement‑learning module that adaptively weights rewards, achieving near‑state‑of‑the‑art performance with up to 86% faster inference.

AI researchGraphRAGHierarchical Retrieval
0 likes · 5 min read
How Deep GraphRAG Solves Retrieval’s Three‑Way Dilemma with Hierarchical Search
Meituan Technology Team
Meituan Technology Team
Jan 23, 2026 · Artificial Intelligence

How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning

EvoCUA, a native computer‑use agent from Meituan, combines a verifiable data‑synthesis engine, a ten‑thousand‑level sandbox infrastructure, and an experience‑driven learning paradigm to overcome data‑scaling and feedback challenges, achieving a 56.7% success rate on the OSWorld benchmark and surpassing previous open‑source models.

AI AgentComputer UseOSWorld
0 likes · 27 min read
How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning
Tencent Advertising Technology
Tencent Advertising Technology
Jan 22, 2026 · Artificial Intelligence

How Tencent’s Bidding Algorithms Evolved from GMPC to GRB: A Deep Dive into Generative RL for Ads

The article reviews the 2025 evolution of Tencent advertising’s bidding system—from the second‑generation GMPC control algorithm through the third‑generation MRB reinforcement‑learning model to the fourth‑generation generative RL GRB—detailing architectural upgrades, multi‑channel modeling, training pipelines, and experimental gains, and outlines the 2026 AI‑agent roadmap.

AdvertisingGenerative ModelsOnline Learning
0 likes · 15 min read
How Tencent’s Bidding Algorithms Evolved from GMPC to GRB: A Deep Dive into Generative RL for Ads
Tencent Cloud Developer
Tencent Cloud Developer
Jan 20, 2026 · Artificial Intelligence

From Transformers to Agents: A Complete Timeline of Large Language Model Evolution

This article traces the evolution of large language models from the 2017 Transformer breakthrough through successive milestones such as BERT, GPT‑3, RL‑HF alignment, multimodal extensions, open‑source alternatives, and the rise of retrieval‑augmented generation, AI agents, and emerging protocols that shape modern AI applications.

Open-source modelsPrompt engineeringRAG
0 likes · 44 min read
From Transformers to Agents: A Complete Timeline of Large Language Model Evolution
PaperAgent
PaperAgent
Jan 19, 2026 · Artificial Intelligence

How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions

Recent research shows that applying reinforcement learning to large language models can dramatically improve inference performance, but its effectiveness depends on the token distribution produced during pre‑training, prompting a novel rewrite of cross‑entropy as a single‑step policy gradient with controllable entropy parameters.

LLMModel OptimizationRL
0 likes · 6 min read
How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions
PaperAgent
PaperAgent
Jan 16, 2026 · Artificial Intelligence

How a 4B Model Beats 30B Giants: Inside AgentCPM-Explore’s SOTA Performance

AgentCPM-Explore, a 4‑billion‑parameter open‑source model, achieves state‑of‑the‑art results on long‑range exploration tasks, matching or surpassing larger 8B and even 30B models, thanks to a full‑stack infrastructure, novel training tricks, and extensive benchmark evaluations across eight agent‑centric datasets.

AgentAgentCPM-ExploreBenchmark
0 likes · 10 min read
How a 4B Model Beats 30B Giants: Inside AgentCPM-Explore’s SOTA Performance
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jan 15, 2026 · Information Security

How Hi-Guard Improves Trustworthy Multimodal Content Moderation with Policy‑Aligned Reasoning

The Hi-Guard framework transforms content moderation by aligning multimodal models with policy rules through hierarchical prompting, a structured taxonomy, and soft‑margin reinforcement learning, achieving significant gains in accuracy, precision, recall, and explainability for large‑scale user‑generated content platforms.

Multimodal AIcontent moderationexplainability
0 likes · 9 min read
How Hi-Guard Improves Trustworthy Multimodal Content Moderation with Policy‑Aligned Reasoning
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Jan 11, 2026 · Artificial Intelligence

FinRpt: A Multi‑Agent Framework for Automatic Generation and Evaluation of Stock Research Reports

FinRpt introduces a novel multi‑agent pipeline that builds a high‑quality stock research report (ERR) dataset from six financial data sources, defines a comprehensive 11‑metric evaluation suite, and demonstrates that supervised‑fine‑tuned and reinforcement‑learned LLM agents significantly outperform single LLM baselines in both accuracy and efficiency.

DatasetFinRptLLM
0 likes · 14 min read
FinRpt: A Multi‑Agent Framework for Automatic Generation and Evaluation of Stock Research Reports
AI Engineering
AI Engineering
Jan 10, 2026 · Artificial Intelligence

Teaching LLMs to Manage Memory Autonomously, Dropping Manual Rules

Alibaba's new AgeMem framework turns long‑term and short‑term memory management for large language model agents into a learnable reinforcement‑learning task, replacing handcrafted rules with a three‑stage training process and achieving significant benchmark gains.

AgeMemBenchmarkGRPO
0 likes · 9 min read
Teaching LLMs to Manage Memory Autonomously, Dropping Manual Rules
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Jan 8, 2026 · Artificial Intelligence

Alpha‑R1: Reinforcement‑Learning‑Driven Large‑Model Alpha Factor Selection

Alpha‑R1 integrates reinforcement learning with an 8‑billion‑parameter LLM to jointly process price and news data, creating context‑aware factor embeddings that outperform traditional quantitative and generic LLM baselines on CSI 300 and CSI 1000 portfolios, demonstrating robust alpha‑decay resistance and zero‑sample generalization.

Financial AIalpha factor selectionlarge language model
0 likes · 16 min read
Alpha‑R1: Reinforcement‑Learning‑Driven Large‑Model Alpha Factor Selection
Amap Tech
Amap Tech
Jan 8, 2026 · Artificial Intelligence

How AI Powers Fancy Video Generation for Real‑World POI Scenes

This article details the AI techniques behind Gaode's "Street Ranking" project, explaining the Fancy video concept, the dual training and production pipelines, and the use of SFT, reinforcement learning, MoE‑LoRA, distribution‑matching distillation, and quality‑filtering to achieve 25× faster generation with high aesthetic fidelity.

AI video generationDistillationmodel fine-tuning
0 likes · 24 min read
How AI Powers Fancy Video Generation for Real‑World POI Scenes
Tencent Advertising Technology
Tencent Advertising Technology
Jan 8, 2026 · Artificial Intelligence

How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking

Tencent's ad tech team redesigned its ad ranking system by adding a parallel user‑experience‑optimized pipeline and evolving from manual CEM tuning to DDPG‑based reinforcement learning, achieving 10‑20% improvements in CTR, repeat‑view rates, and other experience metrics while maintaining overall spend.

AdvertisingUser experiencemulti-objective optimization
0 likes · 17 min read
How Tencent Boosted Ad Experience by Up to 20% Using Reinforcement‑Learning‑Based Ranking
Data Party THU
Data Party THU
Jan 7, 2026 · Artificial Intelligence

Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

A recent study reveals that the widely used KL regularization in LLM reinforcement learning (RLVR) is mathematically biased, leading to unstable training and poorer generalization, and shows that moving the KL term back to the reward with a simple K1 estimator can boost out‑of‑domain performance by up to 20%.

AI researchKL regularizationLLM training
0 likes · 10 min read
Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Jan 6, 2026 · Artificial Intelligence

FinRS: A Risk‑Sensitive Trading Framework for Real‑World Financial Markets

FinRS integrates hierarchical market analysis, dual decision agents, and multi‑time‑scale reward feedback to enable risk‑aware multi‑stage trading, achieving higher cumulative returns, better Sharpe ratios, and lower maximum drawdowns than existing LLM‑based and reinforcement‑learning baselines across diverse stocks.

FinRSLLMfinancial markets
0 likes · 14 min read
FinRS: A Risk‑Sensitive Trading Framework for Real‑World Financial Markets
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Jan 4, 2026 · Artificial Intelligence

How VTA Combines Large‑Model Reasoning for Precise and Explainable Stock Time‑Series Forecasting

The VTA framework integrates large language model reasoning with textual annotation of technical indicators, employs a Time‑GRPO reinforcement‑learning objective and multi‑stage joint conditional training, and achieves state‑of‑the‑art accuracy and expert‑rated interpretability on US, Chinese and European stock datasets.

LLMStock PredictionTime Series
0 likes · 19 min read
How VTA Combines Large‑Model Reasoning for Precise and Explainable Stock Time‑Series Forecasting
PaperAgent
PaperAgent
Dec 29, 2025 · Artificial Intelligence

Unveiling Bottom‑up Policy Optimization: Boosting LLM Reasoning with Internal Strategies

This article introduces Bottom‑up Policy Optimization (BuPO), a novel reinforcement‑learning framework that treats large language models as collections of internal layer and modular policies, revealing distinct inference entropy patterns in Llama and Qwen‑3 and demonstrating superior performance on complex mathematical reasoning benchmarks.

AI researchBottom-up OptimizationInternal Policy
0 likes · 10 min read
Unveiling Bottom‑up Policy Optimization: Boosting LLM Reasoning with Internal Strategies
Data Party THU
Data Party THU
Dec 28, 2025 · Artificial Intelligence

How Causal Reinforcement Learning Is Shaping Robust, Explainable AI

This comprehensive survey examines the emerging field of Causal Reinforcement Learning, classifies its core techniques, introduces eleven benchmark environments, evaluates four novel algorithms, and outlines challenges and future research directions for building robust, generalizable, and interpretable AI systems.

AI Robustnessalgorithm evaluationbenchmark environments
0 likes · 12 min read
How Causal Reinforcement Learning Is Shaping Robust, Explainable AI
DataFunTalk
DataFunTalk
Dec 25, 2025 · Artificial Intelligence

How DeepAgent Redefines General AI Reasoning with Scalable Toolsets

DeepAgent, a new end‑to‑end reasoning agent, integrates autonomous thinking, dynamic tool search, and execution to handle over 16,000 APIs, embodied tasks, and research assistance, achieving state‑of‑the‑art performance on benchmarks like TMDB, ToolBench, ALFWorld, WebShop, and GAIA.

Memory Managementlarge language modelreasoning
0 likes · 15 min read
How DeepAgent Redefines General AI Reasoning with Scalable Toolsets
PaperAgent
PaperAgent
Dec 23, 2025 · Artificial Intelligence

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

CATArena introduces a tournament‑style evaluation framework where AI agents iteratively code, compete, and improve across classic board games, using three‑dimensional quantitative scores to measure strategy programming, global learning, and generalization, and reveals how different LLM‑based agents learn and adapt over multiple rounds.

AI BenchmarkAgent EvaluationCATArena
0 likes · 8 min read
CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning
AI Info Trend
AI Info Trend
Dec 23, 2025 · Industry Insights

How AI Will Boost Collective Productivity: Key Takeaways from Microsoft’s 2025 Future of Work Report

Microsoft’s 2025 New Future of Work report reveals that AI, driven by breakthroughs in reinforcement learning, is shifting from answering questions to executing complex tasks, while investment and corporate adoption surge unevenly and employee involvement emerges as a critical factor for sustainable productivity gains.

AIMicrosoft Reportfuture of work
0 likes · 8 min read
How AI Will Boost Collective Productivity: Key Takeaways from Microsoft’s 2025 Future of Work Report
Bilibili Tech
Bilibili Tech
Dec 19, 2025 · Artificial Intelligence

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

SABER introduces a reinforcement‑learning framework that lets large language models dynamically switch among four token‑budgeted reasoning modes, dramatically cutting inference length while preserving or improving accuracy across math, code, and logic tasks.

Budgeted ComputationEfficient ReasoningLLM
0 likes · 13 min read
SABER: Switchable and Balanced Training for Efficient LLM Reasoning
Instant Consumer Technology Team
Instant Consumer Technology Team
Dec 16, 2025 · Artificial Intelligence

How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power

This article explains how the Mind Lab team tackled the challenges of training a 1‑trillion‑parameter mixture‑of‑experts model for agentic memory using reinforcement learning, LoRA, and a custom Megatron‑Bridge architecture, achieving a ten‑fold speedup while consuming just a fraction of the usual GPU resources.

AIAgentic AppsLoRA
0 likes · 9 min read
How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Dec 15, 2025 · Artificial Intelligence

Turning LLM-Generated Network Configurations into Verified, Safe Updates with Artanis

The paper introduces Artanis, an intent‑based network configuration update framework that combines large‑language‑model generation with a verification‑feedback loop and reinforcement‑learning optimization, addressing hallucination‑induced errors and ensuring safe, policy‑compliant deployments across diverse network scales.

Configuration ManagementIntent-based NetworkingLLM
0 likes · 9 min read
Turning LLM-Generated Network Configurations into Verified, Safe Updates with Artanis
AntTech
AntTech
Dec 11, 2025 · Artificial Intelligence

Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

This article explains how the open‑source AReaL framework boosts large‑scale reinforcement learning by separating agent execution from training logic, introducing a decoupled Agentic RL service and a Single‑Controller architecture that improves data flow, fault tolerance, and GPU utilization.

Agentic AIDistributed TrainingOpen-source
0 likes · 14 min read
Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design
AI Frontier Lectures
AI Frontier Lectures
Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models
0 likes · 12 min read
Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive
Data Party THU
Data Party THU
Dec 9, 2025 · Artificial Intelligence

Can Robots Learn Human Moves Directly from AI‑Generated Videos? The GenMimic Breakthrough

The GenMimic paper introduces a novel framework that enables humanoid robots to zero‑shot imitate human actions generated by AI video models, presenting a new dataset, a two‑stage 4D reconstruction pipeline, and a reinforcement‑learning strategy with weighted‑tracking and symmetry losses, validated in simulation and on a real 23‑DoF robot.

Humanoid RobotsRoboticsVideo Generation
0 likes · 11 min read
Can Robots Learn Human Moves Directly from AI‑Generated Videos? The GenMimic Breakthrough
Baidu Tech Salon
Baidu Tech Salon
Dec 8, 2025 · Artificial Intelligence

How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction

The article details Baidu HuiBosheng's end‑to‑end AI live‑streaming platform, covering merchant workflow, multimodal product understanding, style‑aware script generation, reinforcement‑learning‑driven smart control, voice and avatar cloning, and a data‑flywheel that continuously improves model performance, illustrated with real‑world GMV results.

AIData FlywheelScript Generation
0 likes · 20 min read
How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Dec 7, 2025 · Artificial Intelligence

AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading

AlphaQuanter tackles the three major limitations of existing LLM trading agents by introducing a single‑agent framework that dynamically orchestrates market tools, learns transparent decision policies via reinforcement learning, and achieves state‑of‑the‑art performance on key financial metrics across extensive stock‑level experiments.

AlphaQuanterFinancial AILLM agent
0 likes · 13 min read
AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 7, 2025 · Artificial Intelligence

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Over recent months of extensive agent reinforcement‑learning experiments across search, data‑analysis, and multi‑source scenarios, the author shares twelve practical insights covering stability, environment‑reward‑algorithm priorities, tool‑call reliability, reward hacking pitfalls, evaluation alignment, and scaling tricks for larger models.

PPO EWMARL scalingTool integration
0 likes · 7 min read
Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 7, 2025 · Artificial Intelligence

Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings

This article critically examines recent RL‑for‑LLM studies, revealing that reinforcement learning improves search efficiency but does not extend the intrinsic reasoning capabilities of base models, and explores the underlying model‑conditioned optimization bias, comparisons with SFT distillation, and the trade‑off with catastrophic forgetting.

Catastrophic ForgettingLLMModel Optimization
0 likes · 11 min read
Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings
AntTech
AntTech
Dec 4, 2025 · Artificial Intelligence

How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds

AState is a general‑purpose state data management system for reinforcement‑learning tasks that tackles low IO efficiency, slow weight synchronization, and state‑recovery challenges, achieving sub‑10‑second weight sync for trillion‑parameter models through a three‑layer architecture, zero‑redundancy transfers, and hardware‑aware co‑design, with the code openly available on GitHub.

AStateHigh‑performance computingWeight Synchronization
0 likes · 23 min read
How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds
Model Perspective
Model Perspective
Dec 1, 2025 · Artificial Intelligence

From AI to Everyday Life: How Reinforcement Learning Shapes Our Choices

This article explains the core concepts of reinforcement learning, illustrates how its reward‑based mechanism appears in media creation, career advancement, education and social media, and warns of the pitfalls of over‑optimizing external rewards while offering practical ways to balance intrinsic motivation and reflective thinking.

Career DevelopmentMotivationartificial intelligence
0 likes · 12 min read
From AI to Everyday Life: How Reinforcement Learning Shapes Our Choices
PaperAgent
PaperAgent
Dec 1, 2025 · Artificial Intelligence

How Deep Research Turns LLMs into Autonomous AI Scientists

This article surveys the emerging Deep Research (DR) paradigm that upgrades large language models into research agents capable of autonomous planning, multi‑source evidence gathering, memory management, and verifiable long‑form report generation, outlining its stages, core components, training pipeline, and evaluation benchmarks.

AI agentsAI research automationDeep Research
0 likes · 6 min read
How Deep Research Turns LLMs into Autonomous AI Scientists
Data Party THU
Data Party THU
Nov 29, 2025 · Artificial Intelligence

Unlocking AI Agents: From Fundamentals to Building Your First LLM‑Powered Agent

This comprehensive guide explores the concept of AI agents, detailing their definitions, classifications, and core interaction loops, then walks you through building a functional LLM‑driven travel assistant with step‑by‑step code, tool integration, and practical insights on agent versus workflow paradigms.

AI agentsAgent ArchitectureLLM
0 likes · 39 min read
Unlocking AI Agents: From Fundamentals to Building Your First LLM‑Powered Agent
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Nov 28, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Nov 22‑28, 2025)

This digest summarizes five recent arXiv papers on AI-driven portfolio optimization and financial time‑series forecasting, covering G‑Learning with GIRL, transfer‑learning strategies, hybrid LSTM‑PPO frameworks, time‑series foundation models, and a KAN versus LSTM performance comparison, highlighting their methods, datasets, and reported Sharpe improvements.

Financial AIportfolio optimizationreinforcement learning
0 likes · 9 min read
Weekly Quantitative Finance Paper Digest (Nov 22‑28, 2025)
Tencent Advertising Technology
Tencent Advertising Technology
Nov 28, 2025 · Artificial Intelligence

How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Retrv‑R1, a reasoning‑driven multimodal large language model framework, tackles the precision‑efficiency dilemma of universal multimodal retrieval by introducing a two‑stage coarse‑to‑fine pipeline, an information‑compression module, a detail‑inspection mechanism, and a three‑stage training strategy, achieving SOTA performance across accuracy, efficiency, and generalization benchmarks.

GeneralizationMLLMMultimodal Retrieval
0 likes · 21 min read
How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM
Alimama Tech
Alimama Tech
Nov 26, 2025 · Artificial Intelligence

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

Alibaba’s open‑source ROCK environment sandbox and the ROLL reinforcement‑learning engine together provide a standardized, high‑throughput training loop that lets developers scale Agentic AI from a single machine to thousands of parallel instances while simplifying debugging and resource management.

Agentic AIInfrastructureScalable Training
0 likes · 12 min read
How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training
ITPUB
ITPUB
Nov 24, 2025 · Artificial Intelligence

Why Memory, Not Size, Is the Next Bottleneck for Large Language Models

In a detailed interview, the CTO of Memory Tensor (Shanghai) explains how limited memory capacity hampers large models, outlines the MemOS memory operating system, discusses information‑theoretic metrics, multimodal extensions, and reinforcement‑learning strategies for scalable, secure, and explainable AI memory management.

AI ArchitectureMultimodal AIinformation theory
0 likes · 23 min read
Why Memory, Not Size, Is the Next Bottleneck for Large Language Models
Data Party THU
Data Party THU
Nov 23, 2025 · Artificial Intelligence

Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

This article walks through the fundamentals of reinforcement learning, builds a custom drone‑landing simulation, defines state and action spaces, designs reward functions, implements a neural‑network policy with Bernoulli sampling, and trains it using REINFORCE with baseline techniques, while exposing common pitfalls such as reward‑cheating.

OpenAI GymPythonReward Shaping
0 likes · 22 min read
Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough
AntTech
AntTech
Nov 21, 2025 · Artificial Intelligence

How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models

Awex is a high‑performance Python framework that synchronizes training and inference weights for trillion‑parameter reinforcement‑learning models in seconds, using unified conversion, metadata management, and NCCL/RDMA transfer plans, dramatically reducing RL training latency and supporting diverse parallel strategies.

Distributed TrainingHigh‑performance computingPython
0 likes · 17 min read
How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 20, 2025 · Artificial Intelligence

How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools

DeepAgent is a new end‑to‑end reasoning agent that unifies autonomous thinking, dynamic tool search, and execution, handling over 16,000 real APIs, supporting embodied environments and research assistance, and achieving state‑of‑the‑art results across multiple benchmarks through its unified reasoning core, memory‑folding mechanisms, structured memory, and the ToolPO training framework.

AI agentsGeneral AITool integration
0 likes · 14 min read
How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 20, 2025 · Artificial Intelligence

How DeepAgent Redefines AI Agents with Memory Folding and ToolPO

This article breaks down the DeepAgent paper, explaining its novel "main model + auxiliary model" architecture, the memory‑folding mechanism that compresses long‑context reasoning, and the ToolPO reinforcement strategy that enables efficient tool discovery and usage.

AI agentsToolPOlarge language models
0 likes · 8 min read
How DeepAgent Redefines AI Agents with Memory Folding and ToolPO
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 20, 2025 · Artificial Intelligence

Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning

The article analyzes why reinforcement learning (RL) fine‑tuning retains a large language model's general abilities better than supervised fine‑tuning (SFT), explaining the off‑policy distribution shift of SFT and the on‑policy data consistency, KL penalty, and trust‑region mechanisms that give RL its anti‑forgetting properties.

Catastrophic ForgettingLLMOn-Policy Data
0 likes · 8 min read
Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 19, 2025 · Artificial Intelligence

How We Built an AI‑Powered Automated Video Editing Pipeline for Short‑Form Marketing

This article details the end‑to‑end AIGC video automation system we created—from raw material ingestion and multimodal content understanding to script generation, AI‑driven editing, rendering, and multi‑channel distribution—highlighting architecture, key modules, technical choices, performance results, and lessons learned.

AIGCMultimodal AIScript Generation
0 likes · 16 min read
How We Built an AI‑Powered Automated Video Editing Pipeline for Short‑Form Marketing
AI Tech Publishing
AI Tech Publishing
Nov 17, 2025 · Artificial Intelligence

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

The article evaluates nine cutting‑edge AI models on 150 simulated workplace tasks, showing that even the strongest models complete fewer than 40% of tasks, and uses these results to propose a hierarchical framework of agentic capabilities ranging from tool use to common‑sense reasoning.

AI model evaluationTool Useagentic capabilities
0 likes · 19 min read
Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy
Data Party THU
Data Party THU
Nov 15, 2025 · Artificial Intelligence

How Reinforcement Learning Powers Intelligent AI Agents and LangGraph Workflows

This article explains how reinforcement learning (RL) underpins intelligent AI agents, covering the Markov Decision Process fundamentals, key RL components, multi‑hop reasoning on knowledge graphs, and a step‑by‑step LangGraph example that integrates an RL‑driven tutoring policy with Python code.

AI agentsKnowledge GraphLangGraph
0 likes · 17 min read
How Reinforcement Learning Powers Intelligent AI Agents and LangGraph Workflows
Kuaishou Tech
Kuaishou Tech
Nov 14, 2025 · Artificial Intelligence

How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators

This article explains the over‑optimization problem in GRPO‑based flow models, analyzes why importance‑ratio clipping fails, and introduces GRPO‑Guard with RatioNorm and cross‑step gradient balancing, showing through extensive experiments that it stabilizes training and improves image quality across multiple diffusion backbones and tasks.

GRPO-Guardflow matchinggenerative AI
0 likes · 9 min read
How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Nov 13, 2025 · Artificial Intelligence

Paper Review: AlphaGAT’s Two‑Stage Learning for Adaptive Portfolio Selection

AlphaGAT introduces a two‑stage learning framework that first extracts robust alpha factors with a CATimeMixer model and a novel loss, then dynamically weights these factors via reinforcement learning (PPO) and a graph attention network, achieving superior portfolio performance across DJIA, HSI, CSI‑100 and crypto markets despite noisy data and distribution shifts.

AlphaGATFinancial AITime Series
0 likes · 14 min read
Paper Review: AlphaGAT’s Two‑Stage Learning for Adaptive Portfolio Selection
Alimama Tech
Alimama Tech
Nov 11, 2025 · Artificial Intelligence

Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards

This article introduces the 3A collaborative framework—Async architecture, Asymmetric PPO mini‑critics, and an attention‑based reasoning rhythm—demonstrating how decoupled, fine‑grained parallel training and structure‑aware reward allocation dramatically improve efficiency, scalability, and interpretability of reinforcement learning for large language models.

asynchronous trainingattention mechanismslarge language models
0 likes · 23 min read
Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards
DataFunTalk
DataFunTalk
Nov 7, 2025 · Artificial Intelligence

Training-Free GRPO: Low‑Cost Reinforcement Learning for Large Language Models

Training-Free GRPO, proposed by Tencent Youtu Lab, eliminates parameter updates by iteratively building an experience knowledge base, enabling cost‑effective reinforcement learning for large language models, dramatically reducing training expenses from thousands of dollars to under $20 while maintaining strong performance across math reasoning and web search tasks.

AICost reductionreinforcement learning
0 likes · 6 min read
Training-Free GRPO: Low‑Cost Reinforcement Learning for Large Language Models