Baobao Algorithm Notes
Author

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

291
Articles
0
Likes
2
Views
0
Comments
Recent Articles

Latest from Baobao Algorithm Notes

100 recent articles max
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 11, 2025 · Artificial Intelligence

Why Redesign the Training Stack? Inside Olmo‑Thinking’s Open‑Source RL Journey

This article provides a detailed technical analysis of the Olmo‑Thinking project, covering why a new open‑source LLM was built, the challenges of reinforcement learning at scale, data‑mix optimization, architectural bottlenecks such as missing GQA and QK‑Norm, and the post‑training techniques used to improve reasoning and long‑context capabilities.

RLVRdata selectionopen-source models
0 likes · 20 min read
Why Redesign the Training Stack? Inside Olmo‑Thinking’s Open‑Source RL Journey
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 7, 2025 · Artificial Intelligence

Kimi K2-Thinking: 1T‑Parameter Agent Model Beats GPT‑5 on Humanity’s Last Exam

Kimi's open‑source K2‑Thinking model, a 1‑trillion‑parameter agent with native INT4 quantization and 256k context, achieves SOTA performance on benchmarks like Humanity’s Last Exam, BrowseComp and SEAL‑0, outperforms GPT‑5 and Grok‑4, and demonstrates complex tool‑driven reasoning through real‑world examples.

AIAgent ModelK2-Thinking
0 likes · 6 min read
Kimi K2-Thinking: 1T‑Parameter Agent Model Beats GPT‑5 on Humanity’s Last Exam
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 31, 2025 · Artificial Intelligence

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

This article analyzes why standard reinforcement learning can degrade Pass@K metrics after fine‑tuning large language models, introduces a risk‑sensitive RL objective that reshapes the advantage estimator, and demonstrates through bandit and mathematical‑reasoning experiments that the RS‑GRPO method consistently boosts diversity and overall Pass@K scores across multiple LLMs.

Exploration-ExploitationLLM fine-tuningPolicy Gradient
0 likes · 12 min read
How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 31, 2025 · Artificial Intelligence

Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study

Meta’s recent paper reveals a sigmoid‑shaped scaling law for LLM reinforcement learning, presents extensive 40‑k GPU‑hour experiments, compares various RL designs such as PPO‑off‑policy‑k and Pipeline‑RL‑k, and distills the findings into a practical “ScaleRL” recipe that improves performance and efficiency.

LLMRL OptimizationScaling Law
0 likes · 10 min read
Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 30, 2025 · Artificial Intelligence

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

AILLMSFT
0 likes · 11 min read
Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 28, 2025 · Artificial Intelligence

Why Entropy Collapse Limits LLM Reinforcement Learning and How to Fix It

The article explains how information entropy, cross‑entropy, and KL‑divergence shape reinforcement learning for large language models, describes the phenomenon of entropy collapse, compares token‑level and policy‑level entropy, and reviews recent methods like Clip‑Cov and KL‑Cov that mitigate this issue.

cross-entropyentropypolicy entropy
0 likes · 11 min read
Why Entropy Collapse Limits LLM Reinforcement Learning and How to Fix It
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 28, 2025 · Artificial Intelligence

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

This article breaks down the GPU memory requirements of large language models during training and inference, detailing the contributions of model weights, optimizer states, activations, KV cache, and activation recomputation, and provides concrete formulas, examples, and scaling insights for models like Qwen3 and DeepSeek V3.

GPU memoryKV cacheLLM
0 likes · 18 min read
How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 23, 2025 · Artificial Intelligence

How LongCat-Flash-Thinking Sets New SOTA in Open‑Source AI Inference

LongCat-Flash-Thinking, the latest open‑source model from Meituan, introduces domain‑parallel RL training, a high‑throughput DORA infra, and a dual‑path inference framework that together achieve state‑of‑the‑art performance on logical, mathematical, coding, and agentic tasks while maintaining top‑tier speed.

InferenceLongCatRL training
0 likes · 10 min read
How LongCat-Flash-Thinking Sets New SOTA in Open‑Source AI Inference