DeepSeek vs MCTS: Decoding the ‘Chicken & Liquor’ Dilemma in LLM Training

The article analyzes why DeepSeek’s large‑model training struggles with Monte‑Carlo Tree Search, explains its use of Chain‑of‑Thought prompting, GRPO entropy‑boosting and rejection‑sampling fine‑tuning, compares these methods with Google’s OmegaPRM and PRM approaches, and proposes a concrete MCTS‑driven data‑generation pipeline to overcome the “chicken and liquor” trade‑off.

DataFunTalk
DataFunTalk
DataFunTalk
DeepSeek vs MCTS: Decoding the ‘Chicken & Liquor’ Dilemma in LLM Training

Background and the core question – Algorithm engineers often wonder why a model trained with extensive domain data (e.g., tens of thousands of Q&A pairs) still underperforms compared with competitors such as Doubao. The article frames the problem as a lack of effective CoT (Chain‑of‑Thought) training and asks: how to build the best CoT model?

DeepSeek’s first stage: basic SFT and CoT – DeepSeek‑V3 Base is an SFT model that can generate CoT reasoning but is not “active” enough. To make the model think more, DeepSeek applies GRPO (entropy‑increase) training for 1,000 steps; after GRPO the model consistently outputs reasoning traces, even when the downstream SFT model is unchanged. The resulting DeepSeek‑R1 Zero can produce CoT answers, yet its reasoning chains often lack logical consistency.

Limitations of pure reward‑based reinforcement (RLVR) – Reinforcement Learning with Verifiable Rewards (RLVR) suffers from a training bottleneck after thousands of optimization steps: computational cost rises while performance gains plateau. RLVR’s sparse exploration means the model frequently misses critical reasoning paths, leading to “guess‑and‑hit” behavior.

Why DeepSeek does not use MCTS – DeepSeek admits that Monte‑Carlo Tree Search (MCTS) “failed to deliver results in general reasoning tasks.” The difficulty lies in building a reliable value model for LLM token‑level search; unlike board games where the search space is bounded, LLM reasoning has an exponential token space, making systematic coverage impractical.

Google’s alternative: OmegaPRM and PRM – Google’s OmegaPRM treats MCTS as an offline data‑generation engine, producing process‑reward data (PRM) that scores each reasoning step. This approach yields over 150 k process‑supervision annotations and improves Gemini Pro’s accuracy on MATH from 51 % to 69.4 % (≈36 % relative gain) and on GSM8K from 86.4 % to 93.6 %.

Rejection‑sampling fine‑tuning (RFT) in DeepSeek – DeepSeek uses a “filter‑sampling” pipeline: generate N answers per question, verify correctness (usually by final answer), discard wrong ones, and keep M correct, logically‑reasonable answers as training data. This method, however, can reinforce false‑positive chains when a reasoning step is wrong but the final answer happens to be correct.

Proposed MCTS‑driven data pipeline – The article suggests replacing traditional rejection sampling with an MCTS‑based generator:

For each query, MCTS explores multiple reasoning paths, records each branch’s intermediate steps and final result.

Binary‑search on the reasoning chain quickly locates the first logical error.

Prune low‑value branches and retain high‑value ones, mirroring the “reject‑sampling” philosophy.

Use the retained paths as ground‑truth CoT data, enabling the model to learn multi‑step, self‑verified reasoning.

Concrete workflow – The recommended workflow combines DeepSeek‑R1 Dev‑2 (a language‑consistent SFT model) with MCTS:

Run 2 epochs of SFT on the base model.

Apply GRPO for 1,000 steps to activate active reasoning.

Use the GRPO‑enhanced model as the “brain” to propose solution steps.

Run MCTS on those steps to verify each sub‑answer; backtrack on failures and generate alternative paths.

Collect all verified CoT paths as training data and perform a final GRPO fine‑tuning to obtain the final DeepSeek‑R1 model.

Take‑away – DeepSeek demonstrates that reject‑sampling + SFT + GRPO is feasible, but incorporating MCTS for data generation can provide richer, self‑validated reasoning traces, reduce reliance on manual annotation, and potentially break the “chicken‑without‑liquor / liquor‑without‑chicken” stalemate in large‑model training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsDeepSeekchain-of-thoughtreinforcement learningGRPOMonte Carlo Tree Search
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.