Tagged articles
13 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 4, 2026 · Artificial Intelligence

SFT Scores Don’t Predict RL Potential: Adaptive Early‑Stop Loss for LLMs

The authors show that high SFT accuracy does not guarantee strong RL performance because over‑fitting reduces output diversity, and they propose Adaptive Early‑Stop Loss (AESL), a diversity‑aware early‑stopping objective that dynamically weights token and subsequence losses, yielding consistently better RL results on multiple LLMs and math benchmarks.

AESLDiversityLLM
0 likes · 11 min read
SFT Scores Don’t Predict RL Potential: Adaptive Early‑Stop Loss for LLMs
PaperAgent
PaperAgent
Jan 19, 2026 · Artificial Intelligence

How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions

Recent research shows that applying reinforcement learning to large language models can dramatically improve inference performance, but its effectiveness depends on the token distribution produced during pre‑training, prompting a novel rewrite of cross‑entropy as a single‑step policy gradient with controllable entropy parameters.

LLMModel OptimizationRL
0 likes · 6 min read
How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions
Data Party THU
Data Party THU
Sep 15, 2025 · Artificial Intelligence

Why Merge SFT and RL? Exploring Unified Fine‑Tuning Strategies for LLMs

This article examines the necessity of integrating Supervised Fine‑Tuning (SFT) with Reinforcement Learning (RL) for large language models, surveys alternating, sample‑reuse, simultaneous, and hint‑guided fusion methods, presents the underlying loss functions, and discusses practical trade‑offs such as entropy collapse and importance‑sampling corrections.

LLMRLSFT
0 likes · 14 min read
Why Merge SFT and RL? Exploring Unified Fine‑Tuning Strategies for LLMs
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Aug 25, 2025 · Artificial Intelligence

Decoding OpenAI’s Multi‑Level AGI Roadmap

The article analyzes OpenAI’s five‑layer AGI roadmap, compares it with DeepMind’s ECEVS framework, and examines the technical progress from L1 to L5—including RL‑enhanced chain‑of‑thought, ReAct agents, deep research, and upcoming innovations—while highlighting the commercial implications of each stage.

AGIArtificial IntelligenceChain-of-Thought
0 likes · 7 min read
Decoding OpenAI’s Multi‑Level AGI Roadmap
Data Party THU
Data Party THU
Aug 19, 2025 · Artificial Intelligence

Why RL Fine‑Tuning Fails to Extend LLM Reasoning Limits: Entropy Collapse Explained

This article examines how reinforcement learning fine‑tuning influences large language model reasoning, revealing that RL primarily amplifies pre‑trained capabilities, suffers from entropy collapse, and fails to push the model’s reasoning boundary, supported by extensive experiments on scaling laws, entropy analysis, and mitigation techniques.

LLMRLRLVR
0 likes · 24 min read
Why RL Fine‑Tuning Fails to Extend LLM Reasoning Limits: Entropy Collapse Explained
Data Party THU
Data Party THU
Aug 7, 2025 · Artificial Intelligence

Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability

The paper identifies that GRPO’s token‑level importance weighting introduces high‑variance noise causing instability in large‑scale language model RL training, and proposes GSPO, a sequence‑level importance sampling method that aligns with reward definitions, improves gradient stability, and yields higher training efficiency and better performance across benchmarks.

GRPOGSPORL
0 likes · 8 min read
Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Feb 15, 2025 · Artificial Intelligence

FinRL‑DeepSeek: How Integrating DeepSeek with RL Improves Portfolio Returns (Code Open‑Source)

This article reviews a new risk‑sensitive trading agent that combines reinforcement learning with large language models to extract stock recommendations and news‑based risk scores, describes the extended CVaR‑PPO algorithm, presents extensive experiments on the FNSPID dataset, and discusses the resulting performance gains and future work.

Algorithmic TradingCVaRDeepSeek
0 likes · 10 min read
FinRL‑DeepSeek: How Integrating DeepSeek with RL Improves Portfolio Returns (Code Open‑Source)
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 25, 2024 · Industry Insights

Decoding OpenAI o1: How RL and LLM Fuse to Power Hidden Chain‑of‑Thought

This article analytically reconstructs OpenAI o1’s architecture, training pipeline, and inference workflow, exploring its reinforcement‑learning‑enhanced hidden chain‑of‑thought, multi‑model composition, scaling laws, reward modeling, and potential implications for future AI safety and small‑model strategies.

AI SafetyHidden COTLLM
0 likes · 43 min read
Decoding OpenAI o1: How RL and LLM Fuse to Power Hidden Chain‑of‑Thought
Ctrip Technology
Ctrip Technology
Jun 19, 2019 · Artificial Intelligence

Applying Reinforcement Learning to Hotel Ranking at Ctrip: Challenges, Solutions, and Preliminary Results

This article examines the limitations of traditional learning‑to‑rank for Ctrip hotel sorting, introduces reinforcement learning as a remedy, outlines three progressive implementation plans (A, B, C) with algorithm choices and engineering trade‑offs, and presents early experimental findings that demonstrate RL's potential to improve conversion rates.

CtripRLReinforcement Learning
0 likes · 15 min read
Applying Reinforcement Learning to Hotel Ranking at Ctrip: Challenges, Solutions, and Preliminary Results