Tagged articles
658 articles
Page 3 of 7
Architect's Guide
Architect's Guide
Nov 7, 2025 · Artificial Intelligence

Why Multi-Agent Communication Protocols Are Crucial for Next-Gen AI

The article examines the need for Multi‑Agent Communication Protocols (MCP), outlines the limitations of single‑agent and centralized systems, compares MCP with other interaction methods, reviews current research and industrial applications, and highlights future directions such as hardware integration, bio‑inspired mechanisms, and blockchain convergence.

Blockchaincommunication protocolsdecentralized AI
0 likes · 9 min read
Why Multi-Agent Communication Protocols Are Crucial for Next-Gen AI
Kuaishou Tech
Kuaishou Tech
Nov 5, 2025 · Artificial Intelligence

How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy

This article explains the overthinking problem of large language models, introduces the HiPO framework with hybrid data cold‑start and reinforcement‑learning reward mechanisms that let models decide when to think deeply or answer directly, and shows experimental results demonstrating significant efficiency gains and accuracy improvements across multiple benchmarks.

Hybrid Policy OptimizationLLMadaptive inference
0 likes · 13 min read
How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Nov 4, 2025 · Artificial Intelligence

SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously

SEAgent introduces a self‑evolving framework that enables a GUI agent to master unfamiliar software through autonomous exploration and experience learning, leveraging a curriculum generator, a world‑state model, and GRPO‑based reinforcement with adversarial imitation, achieving state‑of‑the‑art performance on OSWorld.

GUI automationSEAgentautonomous learning
0 likes · 6 min read
SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously
Bilibili Tech
Bilibili Tech
Oct 31, 2025 · Artificial Intelligence

RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization) introduces an adversarial game between a reward model and a translation LLM, combining qualitative preference rewards with quantitative metrics like BLEU, to overcome distribution shift in RLHF and achieve superior performance on conversational subtitle and WMT translation tasks.

BLEULLMReward Modeling
0 likes · 13 min read
RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 31, 2025 · Artificial Intelligence

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

This article analyzes why standard reinforcement learning can degrade Pass@K metrics after fine‑tuning large language models, introduces a risk‑sensitive RL objective that reshapes the advantage estimator, and demonstrates through bandit and mathematical‑reasoning experiments that the RS‑GRPO method consistently boosts diversity and overall Pass@K scores across multiple LLMs.

Exploration-ExploitationLLM fine-tuningRS-GRPO
0 likes · 12 min read
How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 31, 2025 · Artificial Intelligence

Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study

Meta’s recent paper reveals a sigmoid‑shaped scaling law for LLM reinforcement learning, presents extensive 40‑k GPU‑hour experiments, compares various RL designs such as PPO‑off‑policy‑k and Pipeline‑RL‑k, and distills the findings into a practical “ScaleRL” recipe that improves performance and efficiency.

LLMRL Optimizationreinforcement learning
0 likes · 10 min read
Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study
DataFunTalk
DataFunTalk
Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

On-Policy Distillationknowledge distillationmodel efficiency
0 likes · 15 min read
How On-Policy Distillation Cuts LLM Training Cost by 90%
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 30, 2025 · Artificial Intelligence

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

AILLMReward Modeling
0 likes · 11 min read
Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks
Instant Consumer Technology Team
Instant Consumer Technology Team
Oct 28, 2025 · Artificial Intelligence

How 7B AgentFlow Beats 200B GPT-4o: Small Models, Big Wins

AgentFlow, a Stanford-led multi‑agent system built on a 7B model, outperforms massive models like GPT‑4o across ten benchmarks by leveraging modular agents, on‑policy learning, and a novel Flow‑GRPO training engine that solves sparse‑reward, long‑horizon challenges.

AgentFlowSmall Model PerformanceTool Use
0 likes · 12 min read
How 7B AgentFlow Beats 200B GPT-4o: Small Models, Big Wins
Data Party THU
Data Party THU
Oct 24, 2025 · Artificial Intelligence

BREEZE: Enhancing Zero‑Shot Reinforcement Learning with Behavioral Regularization

The paper introduces BREEZE, a behavior‑regularized zero‑shot RL framework that improves stability, policy extraction, and representation quality by combining in‑sample learning, task‑conditioned diffusion models, and expressive attention‑based architectures, achieving near‑state‑of‑the‑art performance on benchmarks like ExORL and D4RL Kitchen.

behavioral regularizationdiffusion modeloffline RL
0 likes · 3 min read
BREEZE: Enhancing Zero‑Shot Reinforcement Learning with Behavioral Regularization
Data Party THU
Data Party THU
Oct 22, 2025 · Artificial Intelligence

Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions

This article provides a comprehensive introduction to reinforcement learning for large language models, covering the Markov Decision Process formulation, the four core elements of RL, state‑value and action‑value functions, Bellman equations, and the advantage function that underpins modern policy‑gradient algorithms.

AI fundamentalsBellman equationMDP
0 likes · 13 min read
Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions
Data Party THU
Data Party THU
Oct 21, 2025 · Artificial Intelligence

Why DQN Overestimates Q‑Values and How Double DQN Fixes It

The article explains how DQN’s use of the max operator introduces a maximization bias that leads to overestimated Q‑values, and shows how Double DQN separates action selection from value evaluation to eliminate this bias, improving stability and performance in Atari benchmarks.

DQNDouble DQNalgorithm analysis
0 likes · 7 min read
Why DQN Overestimates Q‑Values and How Double DQN Fixes It
Data Thinking Notes
Data Thinking Notes
Oct 19, 2025 · Artificial Intelligence

How GSPO Improves Stability in Large Language Model Training

GSPO (Group Sequence Policy Optimization) is a reinforcement‑learning algorithm for LLMs that replaces token‑level GRPO with sequence‑level optimization, addressing instability in ultra‑large model training, especially for long‑sequence and MoE architectures, by aligning reward granularity and reducing variance.

GRPOGSPOlarge language models
0 likes · 11 min read
How GSPO Improves Stability in Large Language Model Training
Xiaohe Frontend Team
Xiaohe Frontend Team
Oct 15, 2025 · Artificial Intelligence

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meta’s new REFRAG framework lets a lightweight encoder compress retrieved text into semantic tags, enabling large language models to answer queries with far fewer tokens, lower latency, and higher throughput, while preserving core meaning and allowing flexible placement of compressed information within prompts.

LLM efficiencyRAGmodel compression
0 likes · 8 min read
REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI
Meituan Technology Team
Meituan Technology Team
Oct 15, 2025 · Artificial Intelligence

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

This curated list showcases Meituan’s latest large‑model breakthroughs and academic papers up to October 2025, spanning LLM system optimizations, multimodal generation, evaluation benchmarks, quantization techniques, and reinforcement‑learning‑driven improvements, offering researchers valuable insights and resources across the AI landscape.

AI researchBenchmarkingMultimodal AI
0 likes · 10 min read
What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025
Data Party THU
Data Party THU
Oct 15, 2025 · Artificial Intelligence

Designing Safe, Sample-Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

This paper proposes a reinforcement‑learning framework that simultaneously ensures safety, sample efficiency, and robustness, applying a contextual‑bandit perspective to ranking/recommendation systems and text‑to‑image diffusion models, and introduces novel algorithms for safe deployment, variance‑reduced off‑policy estimation, and a LOOP method for generative RL.

RobustnessSafetycontextual bandits
0 likes · 5 min read
Designing Safe, Sample-Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 15, 2025 · Artificial Intelligence

Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends

Large language models are evolving from free‑form text generators to reliable data providers by mastering structured output through prompt engineering, validation frameworks, constrained decoding, supervised fine‑tuning, reinforcement learning, and API‑level capabilities, enabling seamless integration with software systems while addressing hallucinations and format reliability.

APILLMPrompt engineering
0 likes · 28 min read
Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends
Volcano Engine Developer Services
Volcano Engine Developer Services
Oct 14, 2025 · Artificial Intelligence

How CollabLLM Redefines LLM Collaboration with Multi‑Turn Training

CollabLLM tackles the limitations of large language models in everyday multi‑turn dialogues by introducing a user‑centric, multi‑turn training framework that leverages simulated interactions, multi‑round reward modeling, and veRL toolchain support, achieving superior performance over single‑turn baselines.

LLMcollaborative trainingmulti-turn dialogue
0 likes · 13 min read
How CollabLLM Redefines LLM Collaboration with Multi‑Turn Training
Shopee Tech Team
Shopee Tech Team
Oct 14, 2025 · Artificial Intelligence

How SPEC‑RL Boosts On‑Policy Reinforcement Learning Speed by Up to 3×

SPEC‑RL introduces speculative rollouts that reuse verified historical rollouts as prefixes, cutting rollout time by 2–3× while maintaining or improving performance across various math and reasoning benchmarks, and works seamlessly with PPO, GRPO, DAPO and other on‑policy algorithms.

AI efficiencyTraining Accelerationlarge language models
0 likes · 8 min read
How SPEC‑RL Boosts On‑Policy Reinforcement Learning Speed by Up to 3×
AntTech
AntTech
Oct 14, 2025 · Artificial Intelligence

How Ring-1T Achieves Trillion-Scale Deep Thinking and Competitive Benchmarks

The Ring-1T model, a trillion-parameter AI system released as open source, leverages advanced reinforcement learning techniques, extensive benchmark evaluations, and custom training frameworks to deliver balanced performance across math, code, reasoning, and creative tasks while highlighting current limitations and future development plans.

AI modelbenchmark evaluationdeep reasoning
0 likes · 8 min read
How Ring-1T Achieves Trillion-Scale Deep Thinking and Competitive Benchmarks
Data Party THU
Data Party THU
Oct 13, 2025 · Artificial Intelligence

How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment

BranchGRPO introduces a tree‑structured branching, reward‑fusion, and lightweight pruning framework that dramatically speeds up diffusion and flow model training while delivering denser, more stable reward signals, achieving up to five‑fold faster convergence and higher alignment scores on image and video generation benchmarks.

BranchGRPORLHFdiffusion models
0 likes · 10 min read
How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 12, 2025 · Artificial Intelligence

Trading-R1: Open-Source LLM Framework for Explainable Financial Trading

This article reviews Trading‑R1, an open‑source LLM inference framework that integrates multimodal financial data, three‑stage supervised‑fine‑tuning and reinforcement learning to generate structured investment arguments and risk‑adjusted trade decisions, achieving superior Sharpe ratio and drawdown performance on real‑world stock and ETF tests.

DatasetFinancial TradingLLM
0 likes · 11 min read
Trading-R1: Open-Source LLM Framework for Explainable Financial Trading
Kuaishou Large Model
Kuaishou Large Model
Oct 11, 2025 · Artificial Intelligence

How Large-Scale Reinforcement Learning Boosted KAT-Dev-72B-Exp to 74.6% on SWE‑Bench

The KwaiPilot team introduced KAT-Dev-72B-Exp, an open‑source LLM trained with large‑scale reinforcement learning that achieved a record‑breaking 74.6% score on SWE‑Bench Verified, thanks to innovations like Trie Packing, entropy‑aware advantage scaling, and a decoupled data‑environment architecture.

KAT-Dev-72B-ExpTrie Packingentropy scaling
0 likes · 6 min read
How Large-Scale Reinforcement Learning Boosted KAT-Dev-72B-Exp to 74.6% on SWE‑Bench
Kuaishou Tech
Kuaishou Tech
Oct 11, 2025 · Artificial Intelligence

How KAT-Dev-72B-Exp Sets a New Record in Large‑Scale RL for Code Generation

The KAT‑Dev‑72B‑Exp model, an experimental reinforcement‑learning version of KAT‑Coder, achieves a 74.6% performance boost on the SWE‑Bench Verified benchmark, introduces Trie Packing and entropy‑aware advantage scaling, and showcases a decoupled training architecture that dramatically speeds up large‑scale agentic RL training.

AICode Generationagentic training
0 likes · 9 min read
How KAT-Dev-72B-Exp Sets a New Record in Large‑Scale RL for Code Generation
Data Party THU
Data Party THU
Oct 10, 2025 · Artificial Intelligence

Can Language Models Self‑Train Without Data? Inside the Language Self‑Play Framework

This article examines the Language Self‑Play (LSP) approach for data‑free training of large language models, detailing its challenger‑solver game formulation, advantage calculations, loss functions, self‑reward extension, experimental setup on AlpacaEval, and results that show LSP can match or surpass data‑driven baselines.

LLMdata-free traininglarge language models
0 likes · 14 min read
Can Language Models Self‑Train Without Data? Inside the Language Self‑Play Framework
DataFunTalk
DataFunTalk
Oct 9, 2025 · Artificial Intelligence

From Physics to DeepMind: How a Tsinghua Star Is Shaping AI Research

Google DeepMind hired Shunyu Yao, a Tsinghua physics prodigy and former Anthropic researcher, whose rapid transition from theoretical physics to AI highlights the intense workload, values clash, and the accelerating pace of large‑model research.

AI researchDeepMindPhysics
0 likes · 9 min read
From Physics to DeepMind: How a Tsinghua Star Is Shaping AI Research
Model Perspective
Model Perspective
Oct 8, 2025 · Artificial Intelligence

How Mathematical Models Reveal the Hidden Dynamics of Addiction

This article explores how differential equations, SIR-like population models, and reinforcement‑learning frameworks can quantitatively describe the onset, persistence, and spread of addictive behaviors, offering insights into feedback loops, neural adaptation, and optimal intervention strategies.

addiction modelingdynamical systemsintervention optimization
0 likes · 10 min read
How Mathematical Models Reveal the Hidden Dynamics of Addiction
DataFunSummit
DataFunSummit
Oct 7, 2025 · Artificial Intelligence

Deep Thinking in Large Language Models: Overcoming Domain Challenges

This presentation explores how large language models can transcend their general knowledge limits by developing domain‑specific deep thinking abilities, addressing challenges such as complex instruction execution, expert reasoning gaps, and tool integration, and proposes reinforcement‑learning‑driven frameworks, structured thinking pipelines, and tool‑calling mechanisms to achieve rational intelligence.

Tool integrationdeep reasoningdomain adaptation
0 likes · 27 min read
Deep Thinking in Large Language Models: Overcoming Domain Challenges
DataFunTalk
DataFunTalk
Oct 7, 2025 · Artificial Intelligence

Can Reinforcement Learning Spot Hallucinations in LLMs? Introducing RL4HS

Apple’s new paper presents RL4HS, a reinforcement‑learning framework that uses span‑level rewards and class‑aware policy optimization to detect hallucinated text spans in large language models, outperforming GPT‑5 and other baselines and offering more precise, auditable error identification.

RL4HShallucination detectionreinforcement learning
0 likes · 9 min read
Can Reinforcement Learning Spot Hallucinations in LLMs? Introducing RL4HS
Amap Tech
Amap Tech
Oct 3, 2025 · Artificial Intelligence

How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene

FantasyHSI introduces a graph‑based multi‑agent framework that combines visual‑language models and video‑generation diffusion to let digital humans perceive, plan, and interact autonomously in any 3D scene, producing physically plausible, long‑duration actions for animation creation and embodied‑AI simulation.

3D synthesisGraph ModelingVideo Generation
0 likes · 12 min read
How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Oct 1, 2025 · Artificial Intelligence

2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context

The 2025 open‑source reports reveal major advances in large‑model engineering, including drastic cost cuts such as DeepSeek‑V3 training for $5.57 M, performance gains where Gemma 3 4B matches Gemma 2 27B, memory efficiencies like 85 % KV‑cache reduction, and a suite of new techniques—from loss‑free MoE balancing to multi‑token prediction—that together push context lengths to one million tokens and enable multimodal, aligned, and industry‑specific models.

Cost reductionMultimodal AIattention mechanisms
0 likes · 13 min read
2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context
Data Party THU
Data Party THU
Sep 28, 2025 · Artificial Intelligence

Can the OaK Architecture Unlock General AI? A Deep Dive into Continuous Learning and Planning

The article presents Richard Sutton’s OaK architecture—a domain‑general, empirical, open‑ended framework that equips agents with continuously learnable components, meta‑learned step‑sizes, and a five‑stage FC‑STOMP pipeline to build world models, generate sub‑problems, learn options, and plan at run‑time.

AI ArchitectureWorld Modelscontinual learning
0 likes · 22 min read
Can the OaK Architecture Unlock General AI? A Deep Dive into Continuous Learning and Planning
HyperAI Super Neural
HyperAI Super Neural
Sep 28, 2025 · Artificial Intelligence

Weekly AI Paper Digest: Vision‑Language Models for Safety, Unstable Singularities, and RL‑Driven Reasoning

This week’s AI paper roundup highlights five recent studies: a construction‑site vision‑language dataset and safety inspection tasks, a deep CORAL method for unsupervised domain adaptation, the discovery of a new family of unstable singularities in nonlinear PDEs, a reinforcement‑learning approach that boosts reasoning in large language models, and the PANORAMA architecture for omnidirectional vision in embodied AI.

Construction SafetyOmnidirectional VisionPDE Research
0 likes · 6 min read
Weekly AI Paper Digest: Vision‑Language Models for Safety, Unstable Singularities, and RL‑Driven Reasoning
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Sep 28, 2025 · Artificial Intelligence

Essential AI Reading List: Must‑Read Books Across AI, ML, DL, and Ethics

This curated list presents the most influential AI books, covering foundational theory, machine learning, deep learning, reinforcement learning, computer vision, and AI ethics, with editorial insights and author biographies to guide readers through the evolving landscape of artificial intelligence.

AI ethicsartificial intelligencereinforcement learning
0 likes · 25 min read
Essential AI Reading List: Must‑Read Books Across AI, ML, DL, and Ethics
HyperAI Super Neural
HyperAI Super Neural
Sep 26, 2025 · Artificial Intelligence

Nvidia’s ReaSyn Uses Chain‑of‑Reaction Reasoning to Boost Molecule Reconstruction and Path Diversity

ReaSyn, a new framework from Nvidia’s research team, treats synthesis pathways as chain‑of‑thought reasoning using a novel Chain‑of‑Reaction representation, achieving the highest reconstruction rates and path diversity in molecule synthesis tasks, and outperforming prior methods across multiple benchmark optimizations.

AI drug discoveryReaSynchain-of-reaction
0 likes · 14 min read
Nvidia’s ReaSyn Uses Chain‑of‑Reaction Reasoning to Boost Molecule Reconstruction and Path Diversity
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 25, 2025 · Artificial Intelligence

How MARS Uses Risk‑Aware Multi‑Agent RL to Master Portfolio Management

This article reviews the MARS framework, a risk‑aware multi‑agent reinforcement‑learning system for automated portfolio management that tackles market non‑stationarity and proactive risk control, detailing its hierarchical architecture, formal MDP formulation, training process, and superior experimental results on DJIA and HSI benchmarks.

Deep LearningMulti-AgentPortfolio Management
0 likes · 13 min read
How MARS Uses Risk‑Aware Multi‑Agent RL to Master Portfolio Management
Fun with Large Models
Fun with Large Models
Sep 24, 2025 · Artificial Intelligence

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

The article explains the fundamental principles of PPO and GRPO reinforcement‑learning algorithms, compares their architectures and training workflows, highlights why GRPO is gaining traction in large‑model fine‑tuning, discusses associated risks, and offers practical guidance on group size selection for engineers preparing for interviews.

GRPOPPORLHF
0 likes · 9 min read
Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning
Data Party THU
Data Party THU
Sep 20, 2025 · Artificial Intelligence

How DeepSeek Trained a $30M LLM for Just $29.4K – Inside the R1 Model

The article reports that DeepSeek’s R1 large language model, detailed in a peer‑reviewed Nature paper, was built with roughly $300 k in total cost—about $29.4 k for training—using Nvidia H800 chips and novel pure reinforcement‑learning techniques, achieving competitive performance while remaining open‑source.

DeepSeekNvidia H800Peer Review
0 likes · 9 min read
How DeepSeek Trained a $30M LLM for Just $29.4K – Inside the R1 Model
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 20, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Sep 13‑19, 2025)

This digest summarizes seven recent arXiv papers that apply reinforcement learning, multi‑agent frameworks, dynamic factor models, high‑frequency trading LLMs, quantum GANs, multi‑LLM sentiment analysis, and context‑aware language models to advance quantitative finance and AI‑driven market prediction.

Quantitative FinanceQuantum Machine Learninglarge language models
0 likes · 12 min read
Weekly Quantitative Finance Paper Digest (Sep 13‑19, 2025)
Data Party THU
Data Party THU
Sep 19, 2025 · Artificial Intelligence

How DeepSeek R1 Redefines AI Reasoning with Pure Reinforcement Learning

DeepSeek R1 replaces traditional supervised fine‑tuning with a pure reinforcement‑learning pipeline, introducing the GRPO algorithm and a four‑stage training regime that dramatically lowers cost, boosts reasoning and code‑generation performance, and raises important ethical, privacy, and societal considerations for large language models.

AI reasoningDeepSeekGRPO
0 likes · 14 min read
How DeepSeek R1 Redefines AI Reasoning with Pure Reinforcement Learning
HyperAI Super Neural
HyperAI Super Neural
Sep 19, 2025 · Artificial Intelligence

Weekly AI Paper Roundup: RL Advances, Tree‑Structured QA, and GraphRAG Breakthroughs

This article surveys five recent AI papers, covering reinforcement learning for large reasoning models, a tree‑structured table QA framework (ST‑Raptor), visual representation alignment for multimodal LLMs, GraphRAG‑based generation, and an LLM‑driven cryptographic vulnerability detector, each with key insights and links.

cryptographic vulnerability detectiongraph retrievallarge language models
0 likes · 5 min read
Weekly AI Paper Roundup: RL Advances, Tree‑Structured QA, and GraphRAG Breakthroughs
DataFunSummit
DataFunSummit
Sep 18, 2025 · Artificial Intelligence

Boosting LLM Function Call: Data, Training, and Agent Optimization Strategies

This presentation by Yao Yitong of China Telecom AI Research Institute explains why Function Call is essential for LLM deployment, outlines data‑centric and training‑centric optimization methods, discusses common pitfalls and reward‑function design for reinforcement learning, and showcases practical Agent application patterns for real‑world tasks.

AgentLLMTraining Optimization
0 likes · 36 min read
Boosting LLM Function Call: Data, Training, and Agent Optimization Strategies
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 14, 2025 · Artificial Intelligence

How MM‑DREX Uses Multimodal LLMs for Dynamic Expert Routing in Financial Trading

The article reviews the MM‑DREX framework, which tackles the non‑stationarity of financial markets by modeling trading as a POMDP, employing a vision‑language model‑driven dynamic router to allocate four heterogeneous experts, and demonstrates superior returns, Sharpe ratios, and drawdown control across stocks, futures, and crypto compared with 15 strong baselines.

LLMPOMDPdynamic routing
0 likes · 13 min read
How MM‑DREX Uses Multimodal LLMs for Dynamic Expert Routing in Financial Trading
Fighter's World
Fighter's World
Sep 12, 2025 · Artificial Intelligence

Why Are Production‑Grade AI Agents So Hard to Build?

The article analyses why production‑grade AI agents remain unreliable, pinpointing the scarcity of high‑quality task‑action data, the limits of static benchmarks, and the need for massive data‑generation engines, simulation sandboxes, sophisticated RL reward design, and efficient context engineering.

AI AgentContext EngineeringData Generation
0 likes · 21 min read
Why Are Production‑Grade AI Agents So Hard to Build?
DataFunTalk
DataFunTalk
Sep 12, 2025 · Artificial Intelligence

Key Takeaways from AI Leaders at the 2024 Inclusion·Bund Conference

The 2024 Inclusion·Bund conference gathered top AI pioneers—including Turing laureate Richard Sutton, Alibaba Cloud founder Wang Jian, HKU professor Ma Yi, Yushu Tech CEO Wang Xingxing, and historian Yuval Harari—to discuss the limits of intelligence, the shift toward open‑source resources, embodied AI, and the societal implications of rapid AI advancement.

AIartificial intelligencereinforcement learning
0 likes · 15 min read
Key Takeaways from AI Leaders at the 2024 Inclusion·Bund Conference
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 11, 2025 · Artificial Intelligence

Fin-PRM: Alibaba’s Dianjin Team Introduces a Domain-Specific Process Reward Model for Financial Reasoning

Fin‑PRM, a domain‑specific process reward model for financial reasoning introduced by Alibaba’s Dianjin team, employs dual‑level step and trajectory rewards to provide fine‑grained supervision, achieving up to 12.9% accuracy gains in supervised fine‑tuning and 5.1% improvements in Best‑of‑N inference on benchmarks such as CFLUE and FinQA.

CFLUEFin-PRMFinQA
0 likes · 11 min read
Fin-PRM: Alibaba’s Dianjin Team Introduces a Domain-Specific Process Reward Model for Financial Reasoning
Instant Consumer Technology Team
Instant Consumer Technology Team
Sep 11, 2025 · Artificial Intelligence

How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework

REFRAG (REpresentation For RAG) introduces a novel decoding framework that compresses, senses, and expands context using precomputed chunk embeddings, achieving up to 30.85× faster first-token generation and 16× larger context windows without sacrificing perplexity, as validated across diverse long‑context tasks.

LLMRAGchunk embeddings
0 likes · 18 min read
How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework
Sohu Tech Products
Sohu Tech Products
Sep 10, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

This article explains the GRPO algorithm, an improvement over PPO for large language model training that eliminates the value network, uses group‑relative advantage estimation, and offers flexible supervision, resulting in higher efficiency, stability, and performance on tasks such as mathematical reasoning.

AI OptimizationGRPOLLM training
0 likes · 16 min read
How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 5, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Aug 30 – Sep 5, 2025)

This digest reviews four recent AI‑driven finance papers: a robust MCVaR portfolio optimizer with ellipsoidal support and RKHS uncertainty, a PPO‑based adaptive weighting system for LLM‑generated alphas, an empirical comparison of price‑based, GICS‑based, and LLM‑embedding stock clustering, and a diffusion‑model approach that generates future financial chart images from current charts and text prompts.

Quantitative Financediffusion modelslarge language models
0 likes · 9 min read
Weekly Quantitative Finance Paper Digest (Aug 30 – Sep 5, 2025)
Data Party THU
Data Party THU
Sep 4, 2025 · Artificial Intelligence

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

This article provides a comprehensive technical analysis of PPO‑based reinforcement learning methods for large language models, detailing the evolution from the original PPO algorithm through GRPO, DAPO, and GSPO, and explaining their motivations, mathematical formulations, advantages, and practical challenges such as entropy collapse and importance‑sampling variance.

DAPOGRPOGSPO
0 likes · 30 min read
Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive
Sohu Tech Products
Sohu Tech Products
Sep 3, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF for Large Language Models

This article explains the motivation, mathematical foundations, implementation details, advantages, experimental results, and future directions of Group Relative Policy Optimization (GRPO), a novel reinforcement‑learning algorithm that replaces PPO’s value network with efficient group‑wise relative evaluation for large language models.

GRPOLLMPPO
0 likes · 17 min read
How GRPO Revolutionizes RLHF for Large Language Models
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 3, 2025 · Artificial Intelligence

Decoding TINs: Reconstructing Classic Technical Analysis with Neural Networks

The paper introduces Technical Indicator Networks (TINs), a framework that maps traditional technical analysis formulas to neural‑network topologies, initializes weights to preserve indicator behavior, and uses reinforcement learning for dynamic optimization, achieving significantly higher Sharpe, Sortino, and cumulative returns on US30 component stocks than conventional MACD approaches.

Algorithmic TradingDeep LearningFinancial AI
0 likes · 9 min read
Decoding TINs: Reconstructing Classic Technical Analysis with Neural Networks
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 3, 2025 · Artificial Intelligence

How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards

Atom-Searcher introduces an atomic‑thought reinforcement‑learning framework that decomposes complex reasoning into fine‑grained units, uses a Reasoning Reward Model to assign step‑wise rewards, dynamically balances process and result incentives, and achieves state‑of‑the‑art performance on multiple LLM benchmarks.

Agentic ResearchAtomic ThoughtLLM
0 likes · 12 min read
How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards
Data STUDIO
Data STUDIO
Sep 2, 2025 · Artificial Intelligence

Understanding NAS: Core Algorithms and Python Implementations

This article reviews Neural Architecture Search (NAS), explains its bi‑level optimization formulation, compares three major search strategies—reinforcement learning, evolutionary algorithms, and differentiable gradient‑based methods—provides complete Python code for each, and analyzes experimental results highlighting performance trade‑offs and remaining challenges.

Deep LearningDifferentiable Architecture SearchEvolutionary Algorithms
0 likes · 25 min read
Understanding NAS: Core Algorithms and Python Implementations
Data Party THU
Data Party THU
Aug 30, 2025 · Artificial Intelligence

Understanding Multi‑Armed Bandits: Balancing Exploration and Exploitation in Reinforcement Learning

Multi‑armed bandit models illustrate the core exploration‑exploitation dilemma in reinforcement learning, covering greedy, ε‑greedy, and optimistic‑initial‑value strategies, as well as sample‑average and incremental Q‑value estimation methods with practical examples and visual illustrations.

Q-value estimationexploration vs exploitationgreedy
0 likes · 15 min read
Understanding Multi‑Armed Bandits: Balancing Exploration and Exploitation in Reinforcement Learning
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Aug 29, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Aug 23‑29, 2025)

This digest summarizes nine recent arXiv papers covering quantum portfolio optimization, thematic investing with semantic stock representations, multi‑indicator reinforcement learning for trading, attention‑based asset pricing, ESG variable selection, deep neural networks for return distribution forecasting, a foundation model for financial time‑series, a multi‑agent trading system with self‑reflection, and dynamic weighting machine‑learning stock selection strategies.

Deep LearningESGQuantitative Finance
0 likes · 17 min read
Weekly Quantitative Finance Paper Digest (Aug 23‑29, 2025)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Aug 27, 2025 · Artificial Intelligence

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Perception‑R1, a post‑training framework that applies rule‑based reinforcement learning to existing multimodal LLMs, dramatically improves visual perception tasks such as grounding, OCR, counting and object detection, as demonstrated by extensive benchmarks and ablation studies.

GRPOPerception PolicyReward Modeling
0 likes · 10 min read
Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision
Kuaishou Tech
Kuaishou Tech
Aug 23, 2025 · Artificial Intelligence

How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning

The Kwai Keye team presents Thyme, a novel multimodal reasoning framework that lets large language models generate and safely execute Python code for image manipulation and complex calculations, achieving significant performance gains over existing vision‑language models across perception, reasoning, and hallucination‑reduction benchmarks.

AI researchCode Generationlarge language model
0 likes · 12 min read
How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning
Architect's Must-Have
Architect's Must-Have
Aug 22, 2025 · Artificial Intelligence

Why Multi-Agent Communication Protocols Are the Future of AI Collaboration

This article examines the limitations of single-agent AI, explains how Multi-Agent Communication Protocols (MCP) address challenges such as incomplete perception, decision conflicts, and scalability, and outlines current research, industrial applications, and future directions including edge integration and blockchain synergy.

BlockchainEdge Computingcommunication protocols
0 likes · 8 min read
Why Multi-Agent Communication Protocols Are the Future of AI Collaboration
Data Thinking Notes
Data Thinking Notes
Aug 21, 2025 · Artificial Intelligence

Why Intermediate Tokens Matter: Denny Zhou’s Deep Insights into LLM Reasoning

This article distills Denny Zhou’s Stanford CS25 lecture, explaining how large language models achieve reasoning through intermediate token generation, chain‑of‑thought prompting, self‑consistency, reinforcement‑learning fine‑tuning, and answer aggregation, while highlighting theoretical foundations and practical breakthroughs.

LLMchain-of-thoughtreasoning
0 likes · 18 min read
Why Intermediate Tokens Matter: Denny Zhou’s Deep Insights into LLM Reasoning
Kuaishou Tech
Kuaishou Tech
Aug 21, 2025 · Artificial Intelligence

How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

SeamlessFlow, an industrial‑scale reinforcement‑learning training framework released by Kuaipilot, decouples trainer and agents via a novel data‑plane, introduces a tag‑based resource scheduler, and eliminates pipeline bubbles, achieving up to 100% token‑throughput boost and 62% reduction in overall training time across large‑model RL workloads.

Distributed Trainingpipeline optimizationreinforcement learning
0 likes · 13 min read
How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%
Kuaishou Large Model
Kuaishou Large Model
Aug 19, 2025 · Artificial Intelligence

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO

Klear-Reasoner, built on Qwen3‑8B‑Base, introduces the Gradient‑Preserving Clipping Policy Optimization (GPPO) algorithm to overcome traditional clip limitations, achieving state‑of‑the‑art performance on AIME2024/2025 and LiveCodeBench while providing detailed experimental analysis and data‑quality insights.

GPPOcode reasoninggradient clipping
0 likes · 11 min read
How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO
AntTech
AntTech
Aug 19, 2025 · Artificial Intelligence

How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks

Ant Group's open‑source native GUI agent UI‑Venus leverages multimodal large‑model and reinforcement‑learning techniques to outperform prior models on grounding and navigation benchmarks, while using a high‑quality data pipeline and a self‑evolving alignment mechanism to push the limits of GUI automation.

BenchmarkGUI AgentMultimodal AI
0 likes · 7 min read
How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks
AI Info Trend
AI Info Trend
Aug 19, 2025 · Industry Insights

What’s Driving the AI Revolution in 2025? Key Trends and Insights

The 2025 H1 AI Core Achievements and Trends report reveals how agents are reshaping productivity, models are gaining inference power and becoming smaller, reinforcement learning is overtaking pre‑training, and industry competition is intensifying, with China and the US narrowing their technology gap.

AIChinaModel Trends
0 likes · 10 min read
What’s Driving the AI Revolution in 2025? Key Trends and Insights
Kuaishou Tech
Kuaishou Tech
Aug 18, 2025 · Artificial Intelligence

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO Optimization

The Klear‑Reasoner model, built on Qwen3‑8B‑Base and powered by the novel Gradient‑Preserving Clipping Policy Optimization (GPPO) algorithm, surpasses same‑size open‑source baselines on challenging math (AIME) and code (LiveCodeBench) benchmarks, while revealing key insights on data quality, reward design, and clipping strategies for large‑language‑model reasoning.

GPPOLLMcode reasoning
0 likes · 11 min read
How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO Optimization
Baobao Algorithm Notes
Baobao Algorithm Notes
Aug 15, 2025 · Artificial Intelligence

Unlocking LLM Performance: Classic Deep RL Tricks Reimagined for Modern Training

This article systematically adapts classic deep reinforcement‑learning techniques—such as multi‑step returns, TD(λ)/GAE, V‑trace corrections, uncertainty‑aware weighting, safety constraints, distribution‑robust optimization, and value‑guided decoding—to improve large language model training and inference, providing concrete formulas, implementation tips, and empirical results.

Deep RLGAELLM
0 likes · 17 min read
Unlocking LLM Performance: Classic Deep RL Tricks Reimagined for Modern Training
Baobao Algorithm Notes
Baobao Algorithm Notes
Aug 14, 2025 · Artificial Intelligence

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

The article analyzes the poor generalization of supervised fine‑tuning (SFT) for large language models, reveals its gradient as a high‑variance inverse‑probability policy gradient, proposes a one‑line Dynamic Fine‑Tuning correction, and shows substantial gains on challenging math and offline RL benchmarks.

Dynamic Fine-TuningGeneralizationLLM alignment
0 likes · 7 min read
Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It
AIWalker
AIWalker
Aug 13, 2025 · Artificial Intelligence

Look-Back Triggers Visual Reflection in Qwen-2.5-VL, +6.3% Perception

Look-Back is an implicit training paradigm that enables the Qwen‑2.5‑VL‑7B multimodal LLM to autonomously re‑focus on visual inputs during reasoning, achieving a 6.3 % boost in perception tasks, outperforming prior baselines while requiring no extra image tokens or model architecture changes.

Look-BackQwen-2.5-VLimplicit training
0 likes · 26 min read
Look-Back Triggers Visual Reflection in Qwen-2.5-VL, +6.3% Perception
Kuaishou Tech
Kuaishou Tech
Aug 6, 2025 · Artificial Intelligence

How Supervised Learning‑Enhanced Multi‑Group Actor‑Critic Boosts Live Stream Allocation in Short‑Video Feeds

This article presents the SL‑MGAC framework, a supervised‑learning‑enhanced multi‑group Actor‑Critic algorithm that improves live‑stream insertion decisions in mixed short‑video and live‑stream recommendation systems, achieving higher stability and better long‑term user engagement while satisfying platform constraints, as validated by extensive offline and online experiments.

KDD 2025actor-criticlive stream recommendation
0 likes · 9 min read
How Supervised Learning‑Enhanced Multi‑Group Actor‑Critic Boosts Live Stream Allocation in Short‑Video Feeds
AIWalker
AIWalker
Aug 5, 2025 · Artificial Intelligence

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

The paper introduces Perception‑R1, a rule‑based reinforcement‑learning framework that trains multimodal large language models for visual perception tasks without relying on chain‑of‑thought reasoning, and demonstrates up to 17.9% performance gains on RefCOCO+, PixMo‑Count, PageOCR and COCO2017, while analyzing the key roles of perception confusion and reward design.

BenchmarkRLHFVisual Perception
0 likes · 24 min read
Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks
AI Info Trend
AI Info Trend
Aug 4, 2025 · Industry Insights

How AI Agents and Small Models Are Redefining Productivity in 2025 H1

The report analyzes first‑half‑2025 AI breakthroughs, covering the rise of general‑purpose agents, rapid inference improvements, small‑model proliferation, reinforcement‑learning compute dominance, evolving transformer architectures, and shifting industry dynamics, offering actionable insights for researchers, product leaders, and decision‑makers.

AIAgentTrend
0 likes · 9 min read
How AI Agents and Small Models Are Redefining Productivity in 2025 H1
JD Tech
JD Tech
Jul 29, 2025 · Artificial Intelligence

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

This article describes a QCon talk that combines causal inference with large language models to build a retrieval‑augmented generation pricing system for e‑commerce, detailing the three‑step algorithm, LLM‑driven modeling challenges, process‑reward tree search, reinforcement‑learning fine‑tuning, and experimental gains in accuracy and speed.

Retrieval Augmented Generationcausal inferencee‑commerce pricing
0 likes · 17 min read
How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing
AI Algorithm Path
AI Algorithm Path
Jul 27, 2025 · Artificial Intelligence

Understanding RLHF: How Human Feedback Trains Modern LLMs

This article explains the RLHF (Reinforcement Learning from Human Feedback) pipeline that powers ChatGPT and other large language models, covering the limitations of traditional fine‑tuning, the creation of human‑feedback datasets, reward‑model training, loss design, and the final PPO‑based fine‑tuning step.

ChatGPTHuman FeedbackPPO
0 likes · 8 min read
Understanding RLHF: How Human Feedback Trains Modern LLMs
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Jul 24, 2025 · Artificial Intelligence

Exploring Recent Large‑Model Agent Papers: Insights and Analyses

This article reviews a series of recent research papers on large‑model agents, covering topics such as reinforcement‑learning‑driven ML agents, premise‑critique ability of LLMs, long‑term tool‑augmented LLM evaluation, agentic RAG, set‑based retrieval for multi‑hop QA, mobile VLM agents, and broader surveys of LLM applications, summarizing each work’s problem statement, prior approaches, novel contributions, experimental results, limitations, and future directions.

Agentic AIBenchmarkLLM evaluation
0 likes · 46 min read
Exploring Recent Large‑Model Agent Papers: Insights and Analyses
Fun with Large Models
Fun with Large Models
Jul 23, 2025 · Artificial Intelligence

Why ChatGPT Agent Sets the Benchmark for Future Large‑Model AI Agents

The article analyzes OpenAI's ChatGPT Agent—its launch, performance metrics, all‑in‑one tool integration, real‑world use cases, pricing tiers, core capabilities, and how it surpasses competing agents like Manus, highlighting its significance for the next generation of AI agents.

AI AgentChatGPT AgentUse Cases
0 likes · 11 min read
Why ChatGPT Agent Sets the Benchmark for Future Large‑Model AI Agents
JD Tech Talk
JD Tech Talk
Jul 23, 2025 · Artificial Intelligence

Causal Inference + LLMs: Transforming E‑Commerce Pricing Strategies

This article describes how integrating causal inference with large language models and Retrieval‑Augmented Generation can automate and explain e‑commerce product pricing, detailing the three‑step workflow, reinforcement‑learning rewards, experimental results, and future directions for end‑to‑end RAG‑LLM training.

RAGcausal inferencee‑commerce pricing
0 likes · 15 min read
Causal Inference + LLMs: Transforming E‑Commerce Pricing Strategies
JD Cloud Developers
JD Cloud Developers
Jul 23, 2025 · Artificial Intelligence

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

At QCon 2025, the author presented a novel approach that integrates causal inference with large language models using Retrieval‑Augmented Generation, process rewards, and tree‑search to generate explainable, accurate e‑commerce pricing recommendations, dramatically improving accuracy from 44% to 74% while cutting inference time to seconds.

causal inferencee‑commerce pricingreinforcement learning
0 likes · 14 min read
How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing
DataFunTalk
DataFunTalk
Jul 23, 2025 · Artificial Intelligence

Qwen3‑Coder: Open‑Source AI Programming Agent That Beats the Competition

Alibaba’s Tongyi team unveiled the open‑source Qwen3‑Coder, a massive 450‑billion‑parameter programming model that outperforms leading closed‑source solutions, supports up to 1 M token context, offers a free CLI tool, and demonstrates impressive code generation capabilities across animations, games, and real‑world tasks.

AI programmingCode GenerationQwen3-Coder
0 likes · 5 min read
Qwen3‑Coder: Open‑Source AI Programming Agent That Beats the Competition
Kuaishou Tech
Kuaishou Tech
Jul 21, 2025 · Artificial Intelligence

Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning

The article introduces KAT‑V1 AutoThink, a dual‑mode large language model that automatically switches between thinking and non‑thinking modes based on problem difficulty, details its novel training paradigm, reinforcement‑learning enhancements, performance benchmarks against leading open‑source models, and provides open‑source resources for further research.

auto-thinkknowledge distillationlarge language model
0 likes · 14 min read
Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning
JD Retail Technology
JD Retail Technology
Jul 21, 2025 · Artificial Intelligence

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

This article presents a comprehensive approach that combines causal inference, large language models, and retrieval‑augmented generation to automate e‑commerce price recommendation, detailing the three‑step workflow, challenges across product categories, the RAG architecture, process‑reward‑guided tree search, reinforcement learning refinements, and experimental results showing significant accuracy and speed improvements.

causal inferencechain-of-thoughte‑commerce pricing
0 likes · 16 min read
How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing
Alimama Tech
Alimama Tech
Jul 17, 2025 · Artificial Intelligence

How to Build a High‑Scoring AI Werewolf Agent: Strategies, Prompt Engineering, and Code

This article details the author's experience designing a top‑performing AI Werewolf agent for the Taotian Group's AI Werewolf Challenge, covering game rules, core challenges, prompt engineering, caching, concurrent requests, model selection, reinforcement‑learning‑style tuning, and tactical strategies for each role, with code examples.

AI AgentLLMPrompt engineering
0 likes · 25 min read
How to Build a High‑Scoring AI Werewolf Agent: Strategies, Prompt Engineering, and Code
DataFunTalk
DataFunTalk
Jul 16, 2025 · Artificial Intelligence

MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context

MiniMax’s latest M1 model, unveiled after a $300 million funding round, showcases a 4.56‑trillion‑parameter hybrid‑expert architecture with lightning attention, supporting up to one million tokens, and leverages reinforcement‑learning techniques to enhance long‑context handling, inference efficiency, and system‑2 reasoning capabilities.

AI scalingModel architecturehybrid attention
0 likes · 16 min read
MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context
AI Algorithm Path
AI Algorithm Path
Jul 14, 2025 · Artificial Intelligence

The Most Powerful Open‑Source Agent Model: Kimi K2

Kimi K2, an open‑source trillion‑parameter AI model released by Moonshot AI, offers Base and Instruct variants, achieves leading scores on benchmarks such as SWE‑bench, LiveCodeBench and AceBench, and introduces a novel post‑training autonomous‑exploration stage with MuonClip optimization to enable robust tool use and reinforcement‑learning‑driven self‑improvement.

Autonomous AgentsKimi K2Tool Use
0 likes · 8 min read
The Most Powerful Open‑Source Agent Model: Kimi K2
AI Frontier Lectures
AI Frontier Lectures
Jul 14, 2025 · Artificial Intelligence

Can Language Models Self‑Edit? Inside SEAL’s Self‑Adapting LLM Framework

The article surveys recent AI self‑evolution research, highlights the SEAL self‑adapting language model framework, explains its reinforcement‑learning based self‑editing mechanism, and presents experimental results on few‑shot learning and knowledge integration, while noting limitations and providing links to the paper and code.

AI self-improvementMeta LearningSEAL
0 likes · 12 min read
Can Language Models Self‑Edit? Inside SEAL’s Self‑Adapting LLM Framework
Python Programming Learning Circle
Python Programming Learning Circle
Jul 10, 2025 · Artificial Intelligence

Build a DQN Autonomous Driving Agent with gym and highway‑env

This tutorial walks through installing gym and highway‑env, configuring six driving scenarios, processing observations (kinematics, images, occupancy grids), defining actions and rewards, constructing a DQN network, training it with a reinforcement‑learning loop, and analyzing collision, time, and reward metrics.

DQNautonomous drivinggym
0 likes · 10 min read
Build a DQN Autonomous Driving Agent with gym and highway‑env
Data Thinking Notes
Data Thinking Notes
Jul 8, 2025 · Artificial Intelligence

How Xiaohongshu Leverages Large Models to Revolutionize Content Recommendation

This article details Xiaohongshu's multi‑stage recommendation pipeline—using massive multi‑modal pre‑training, long‑sequence modeling, real‑time context features, reinforcement learning and online deep learning—to precisely surface valuable content, address cold‑start challenges, and break information bubbles for billions of users.

Multimodal Learninglarge language modelonline deep learning
0 likes · 16 min read
How Xiaohongshu Leverages Large Models to Revolutionize Content Recommendation
DataFunSummit
DataFunSummit
Jul 5, 2025 · Artificial Intelligence

Boosting Large Model Training: Optimizing Performance with the Verl Framework

Join the DataFun Summit 2025 on July 12 to hear Tencent FinTech senior researcher Gong Dihong discuss how redesigning the Verl training system, integrating Megatron and Sglang, and applying new synchronization and offloading techniques dramatically speeds up large‑model reinforcement‑learning training.

AI PerformanceMegatronTraining Optimization
0 likes · 4 min read
Boosting Large Model Training: Optimizing Performance with the Verl Framework
AI Frontier Lectures
AI Frontier Lectures
Jul 2, 2025 · Artificial Intelligence

Can Language Models Self‑Edit? Inside the SEAL Framework for Self‑Adapting LLMs

This article reviews recent AI self‑evolution research and provides an in‑depth analysis of the SEAL (Self‑Adapting Language) framework, which enables large language models to generate and learn from their own synthetic data through a nested reinforcement‑learning and fine‑tuning loop, with experimental results on few‑shot and knowledge‑integration tasks.

Few‑Shot LearningMeta LearningSEAL
0 likes · 11 min read
Can Language Models Self‑Edit? Inside the SEAL Framework for Self‑Adapting LLMs
DataFunTalk
DataFunTalk
Jul 2, 2025 · Artificial Intelligence

How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning

Zhipu AI unveiled the GLM-4.1V-Thinking series, an open‑source multimodal model that outperforms larger rivals on visual‑language tasks, supports video analysis, GUI agents, and advanced scientific reasoning, while introducing a curriculum‑sampling reinforcement‑learning framework and a new Agent application platform.

AI agentsGLM-4.1VMultimodal AI
0 likes · 10 min read
How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 30, 2025 · Artificial Intelligence

How End‑to‑End Reinforcement Learning Powers the Kimi‑Researcher AI Agent

The article examines Kimi‑Researcher, an AI research agent built with end‑to‑end reinforcement learning, detailing its technical motivations, advantages over traditional workflow‑based and SFT methods, performance breakthroughs on benchmark exams, and diverse real‑world use cases ranging from literature reviews to legal analysis.

AI AgentEnd-to-End RLKimi Researcher
0 likes · 10 min read
How End‑to‑End Reinforcement Learning Powers the Kimi‑Researcher AI Agent
Fighter's World
Fighter's World
Jun 28, 2025 · Artificial Intelligence

What Is the Generator‑Verifier Gap and Why It Matters for LLM Reasoning

The article explains the Generator‑Verifier Gap (GVG)—the asymmetry where verifying a solution is far cheaper than generating it—covers its origin, its impact on test‑time scaling for large language models, reinforcement‑learning approaches, and how the concept can shape agent architectures and AI product strategy.

Agent ArchitectureGenerator-Verifier GapLLM
0 likes · 21 min read
What Is the Generator‑Verifier Gap and Why It Matters for LLM Reasoning