Artificial Intelligence 14 min read

DeepSeek’s MCTS Failure: The ‘Roast Chicken and Baijiu’ Dilemma in LLM Training

The article examines why DeepSeek’s large‑model training cannot yet leverage Monte‑Carlo Tree Search, detailing its reliance on SFT, GRPO‑driven CoT activation and rejection‑sampling, contrasting this with Google’s PRM‑based approaches, and proposing a MCTS‑powered data‑generation pipeline to overcome the “roast chicken and baijiu” training dilemma.

DataFunSummit

May 4, 2026

DeepSeek’s MCTS Failure: The ‘Roast Chicken and Baijiu’ Dilemma in LLM Training

Algorithm engineers often wonder why their in‑house models lag behind products like Doubao even when they inject massive domain knowledge (e.g., tens of thousands of Q&A pairs) via RAG or SFT. The author explains that conventional SFT merely learns a direct mapping between question and answer, whereas the missing link is a robust Chain‑of‑Thought (CoT) stage that extracts reasoning summaries before producing the final answer.

1. The “wind‑blown” beginning

DeepSeek‑R1 is presented as the first open‑source large model that openly demonstrates top‑level long‑CoT capabilities. It builds on the DeepSeek V3 Base SFT model, which can generate CoT but lacks sufficient “enthusiasm”. To address this, DeepSeek applies a GRPO (gradient‑based entropy‑increase) method for 1,000 steps, forcing the model to think more before answering. After GRPO training, the model consistently produces detailed reasoning traces while retaining the knowledge learned from prior SFT.

However, the GRPO‑trained CoT model (DeepSeek R1 Zero) sometimes generates logically incoherent chains. For example, DeepSeekMath’s GRPO rewards are based solely on the final answer correctness; a reasoning step that is wrong but happens to yield a correct answer (a false positive) can be mistakenly reinforced, causing the model to “guess” rather than truly reason.

2. Inherent shortcomings of current reinforcement approaches

Reinforcement Learning with Verifiable Rewards (RLVR) has become essential for high‑level reasoning, yet after thousands of optimization steps the performance gains plateau while computational costs keep rising. The bottleneck stems from RLVR’s sparse exploration: models rely on limited back‑tracking and miss critical reasoning paths.

DeepSeek’s current pipeline mixes RL with SFT: 600k high‑quality CoT examples are generated, then 200k non‑reasoning data are mixed in for SFT to prevent collapse. This still leaves a “guessing” risk because the process lacks explicit verification. The authors note that Monte‑Carlo Tree Search (MCTS) could provide self‑verification, but DeepSeek admits in its paper that MCTS “failed to deliver results in general reasoning tasks”.

Google’s research, however, shows that process‑reward models (PRM) that score each reasoning step can accurately identify process errors, dramatically reducing false‑positive answers. In the “Improve Mathematical Reasoning in Language Models by Automated Process Supervision” paper, applying MCTS‑based data generation to Gemini Pro raised MATH benchmark accuracy from 51 % to 69.4 % (≈36 % relative gain) and GSM8K from 86.4 % to 93.6 %.

3. A temporary workaround

Google’s OmegaPRM algorithm treats MCTS as an offline data‑generation engine, producing process‑supervision data for training a PRM. The pipeline consists of:

Binary search on a CoT chain to locate the first logical error.

Balancing positive and negative samples in the search tree to ensure high‑quality training data.

Efficiently collecting >150 k process‑supervision annotations automatically.

In this mode, MCTS is not used for online inference (as in AlphaGo) but to explore diverse reasoning paths, prune low‑value branches, and keep high‑value ones—mirroring the spirit of rejection sampling.

4. Path‑fusion example

To explore a concrete MCTS success case, the author recommends downloading the open‑source KataGo code, which reproduces AlphaGo’s MCTS core. By integrating KataGo’s self‑play training with DeepSeek R1 Dev‑2, one can generate multi‑step CoT paths without human annotation. The suggested workflow is:

Run 2 epochs of SFT on the base model.

Use the resulting DeepSeek R1 Dev‑2 as a “brain” to outline solution steps.

Apply MCTS to verify each step; if a step fails, backtrack and regenerate, producing multiple correct CoT paths.

Collect all generated CoT data and train a GRPO model to obtain the final DeepSeek R1.

This approach satisfies two goals: (1) “model ruminates” – the model trains on data it has itself reasoned about, and (2) no manual labeling – the data are generated automatically via MCTS.

In summary, DeepSeek demonstrates that a combination of SFT + GRPO + rejection sampling can work, but the author argues that replacing the handcrafted rejection‑sampling stage with an MCTS‑driven pipeline would yield more reliable, step‑wise verified reasoning and avoid the entropy‑collapse problem observed in long‑CoT training.

Thank you for reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models Chain-of-Thought GRPO Monte Carlo Tree Search Rejection Sampling Process Reward Model

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.