Author

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

291

Articles

Likes

Views

Comments

Latest from Baobao Algorithm Notes

100 recent articles max

Baobao Algorithm Notes

Sep 22, 2025 · Artificial Intelligence

How to Add Special Tokens to LLMs Without Losing Performance

This guide explains why naïvely adding special tokens during supervised fine‑tuning can destabilize a large language model, and provides step‑by‑step strategies—including tokenizer updates, embedding resizing, smart initialization, and LoRA‑based PEFT—to integrate new tokens while preserving the model's original capabilities.

LLMLoRAspecial tokens

0 likes · 9 min read

How to Add Special Tokens to LLMs Without Losing Performance

Baobao Algorithm Notes

Sep 10, 2025 · Artificial Intelligence

Qwen3-Next Unveiled: Sparse MoE, Hybrid Attention & Multi‑Token Prediction

A recent Hugging Face pull request reveals Alibaba’s upcoming Qwen3‑Next series, highlighting its extreme‑context, parameter‑efficient design that combines a 1:50 high‑sparsity MoE, a hybrid attention architecture mixing gated attention with Gated DeltaNet, and a Multi‑Token Prediction technique, promising ten‑fold throughput gains for 32K‑plus token contexts.

AI ArchitectureHybrid attentionMulti‑Token Prediction

0 likes · 8 min read

Qwen3-Next Unveiled: Sparse MoE, Hybrid Attention & Multi‑Token Prediction

Baobao Algorithm Notes

Sep 9, 2025 · Artificial Intelligence

Why Do Language Models Hallucinate? Roots, Risks, and a New Evaluation Approach

The article analyzes OpenAI's study on language‑model hallucinations, explaining how statistical limits in pre‑training and flawed binary evaluation incentives cause false answers, and proposes a confidence‑threshold scoring system that rewards honest "I don’t know" responses to improve reliability.

AI safetyLanguage ModelsModel Alignment

0 likes · 8 min read

Why Do Language Models Hallucinate? Roots, Risks, and a New Evaluation Approach

Baobao Algorithm Notes

Sep 3, 2025 · Artificial Intelligence

How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards

Atom-Searcher introduces an atomic‑thought reinforcement‑learning framework that decomposes complex reasoning into fine‑grained units, uses a Reasoning Reward Model to assign step‑wise rewards, dynamically balances process and result incentives, and achieves state‑of‑the‑art performance on multiple LLM benchmarks.

Agentic ResearchAtomic ThoughtLLM

0 likes · 12 min read

How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards

Baobao Algorithm Notes

Sep 2, 2025 · Artificial Intelligence

How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model

LongCat‑Flash is a 560‑billion‑parameter Mixture‑of‑Experts LLM that combines a dynamic zero‑computation expert design, shortcut‑connected MoE communication, variance‑aligned scaling, and a three‑stage agent‑centric pre‑training pipeline, delivering over 100 TPS on H800 GPUs at a cost of $0.70 per million tokens.

Artificial IntelligenceLarge Language ModelLongCat-Flash

0 likes · 23 min read

How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model

Baobao Algorithm Notes

Aug 17, 2025 · Artificial Intelligence

Boost 7B LLM Math Reasoning Beyond GPT‑4o with a Simple Pass@k Reward

By replacing the traditional Pass@1 reward with a Pass@k formulation and a lightweight advantage computation, a 7B language model can dramatically improve its performance on math reasoning benchmarks, surpassing GPT‑4o while adding only a few lines of code and minimal training overhead.

PythonRLHFreward engineering

0 likes · 7 min read

Boost 7B LLM Math Reasoning Beyond GPT‑4o with a Simple Pass@k Reward

Baobao Algorithm Notes

Aug 15, 2025 · Artificial Intelligence

Unlocking LLM Performance: Classic Deep RL Tricks Reimagined for Modern Training

This article systematically adapts classic deep reinforcement‑learning techniques—such as multi‑step returns, TD(λ)/GAE, V‑trace corrections, uncertainty‑aware weighting, safety constraints, distribution‑robust optimization, and value‑guided decoding—to improve large language model training and inference, providing concrete formulas, implementation tips, and empirical results.

Deep RLGAELLM

0 likes · 17 min read

Unlocking LLM Performance: Classic Deep RL Tricks Reimagined for Modern Training

Baobao Algorithm Notes

Aug 14, 2025 · Artificial Intelligence

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

The article analyzes the poor generalization of supervised fine‑tuning (SFT) for large language models, reveals its gradient as a high‑variance inverse‑probability policy gradient, proposes a one‑line Dynamic Fine‑Tuning correction, and shows substantial gains on challenging math and offline RL benchmarks.

Dynamic Fine-TuningGeneralizationLLM alignment

0 likes · 7 min read

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

Baobao Algorithm Notes

Aug 11, 2025 · Industry Insights

Why AI Infrastructure Must Be Close to Models and Hardware – Insights from Zhu Yibo

In a WAIC 2025 interview, Zhu Yibo, co‑founder of Jiejie Xingchen, shares deep insights on AI infrastructure, covering its evolution, the need for tight model‑hardware co‑design, cost‑efficiency metrics, industry challenges, and future directions for large‑scale AI systems.

AI infrastructureHardware Optimizationindustry insights

0 likes · 36 min read

Why AI Infrastructure Must Be Close to Models and Hardware – Insights from Zhu Yibo

Baobao Algorithm Notes

Aug 4, 2025 · Artificial Intelligence

Why GPT‑OSS Chooses a 64‑Dimensional Attention Head and 2880 Hidden Size

This article analyzes the surprising design choices of the rumored GPT‑OSS 120B model, explaining the rationale behind a 64‑dimensional attention head, the equal hidden and intermediate sizes, and other quirky parameters such as MLP bias and KV‑sink SWA, backed by theoretical formulas and empirical benchmarks.

Attention HeadGPT-OSSMLP Ratio

0 likes · 13 min read

Why GPT‑OSS Chooses a 64‑Dimensional Attention Head and 2880 Hidden Size