How Marco‑o1 Merges Chain‑of‑Thought Fine‑Tuning with Monte‑Carlo Tree Search for Superior Reasoning

The article introduces Marco‑o1, an open‑source LLM that enhances complex reasoning by fine‑tuning on Chain‑of‑Thought data, integrating Monte‑Carlo Tree Search, introducing mini‑step actions and a reflection mechanism, and evaluates its performance on multilingual math and translation benchmarks.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
How Marco‑o1 Merges Chain‑of‑Thought Fine‑Tuning with Monte‑Carlo Tree Search for Superior Reasoning

Introduction

OpenAI’s groundbreaking o1 model demonstrated exceptional reasoning on platforms such as AIME and CodeForces, inspiring Alibaba’s research team to push the limits of large language models (LLMs). They released the open‑source Marco‑o1 model, aiming to extend reasoning capabilities to domains lacking clear standards and quantifiable rewards.

Key Features of the Paper

CoT‑based fine‑tuning : Full‑parameter fine‑tuning on a combination of open‑source Chain‑of‑Thought (CoT) datasets and synthetically generated data, resulting in the Marco‑o1‑CoT model.

MCTS‑augmented solution space : Integration of Monte‑Carlo Tree Search (MCTS) with LLM outputs to guide search using confidence scores, expanding the reachable solution space.

Reasoning action strategy : Introduction of novel action and reflection mechanisms (Marco‑o1‑MCTS mini‑step) that explore different granularities of reasoning steps.

Application to machine translation : First use of a large reasoning model (LRM) for translation tasks, exploring multilingual scaling of reasoning time.

CoT Data Fine‑Tuning

The authors fine‑tune Marco‑o1 with three datasets:

Open‑O1 CoT Dataset (Filtered) : A refined version of the Open‑O1 CoT data, filtered through heuristic quality checks.

Marco‑o1 CoT Dataset (Synthetic) : Synthetic CoT data generated via MCTS, providing richer reasoning paths.

Marco Instruction Dataset : Instruction‑following data to improve task generalization.

Monte‑Carlo Tree Search (MCTS) Overview

MCTS is a heuristic search algorithm that builds a decision tree through random simulations, suitable for large search spaces such as games or combinatorial problems. It consists of four repeated steps:

Selection : Traverse the tree from the root using a policy such as Upper Confidence Bounds applied to Trees (UCT) until an unexpanded node is reached.

Expansion : Add one or more child nodes by generating candidate actions from the LLM.

Simulation (Rollout) : Run a random or policy‑guided simulation from the new node to a terminal state, recording token‑level confidence scores.

Backpropagation : Propagate the simulation outcome (reward) up the tree, updating visit counts and average rewards.

Combining MCTS with CoT to Extend Reasoning Paths

In Marco‑o1, each MCTS node represents a reasoning state, and actions correspond to LLM‑generated tokens or token groups (steps or mini‑steps). During rollouts, token confidence scores are computed via softmax over the token’s log‑probability and the top‑5 alternatives: confidence_i = (logp_i) / (logp_i + Σ_{k=1}^{5} logp_{k}) The average confidence across a rollout yields a reward signal that guides the selection of promising branches.

Reasoning Action Strategies

The authors experimented with two granularity levels for actions:

Step‑level actions : Each action is a full reasoning step, offering efficient exploration but potentially missing fine‑grained pathways.

Mini‑step actions : Groups of 32 or 64 tokens form a mini‑step, providing a finer search granularity that improves performance on complex tasks, albeit at higher computational cost.

Reflection Mechanism

A self‑reflection prompt (“…maybe I made a mistake! I need to rethink from the beginning.”) is appended after each reasoning episode, encouraging the model to critique and revise its own output. This mechanism roughly doubles the success rate on difficult problems that the base model initially solves incorrectly.

Experimental Results

MGSM Multilingual Math Benchmark

Using Qwen2‑7B‑Instruct as the base, the authors fine‑tuned Marco‑o1‑CoT and incorporated MCTS with three action granularities (step, 64‑token mini‑step, 32‑token mini‑step). Results show a clear advantage of Marco‑o1‑CoT over the base model on the English MGSM set, while performance drops on the Chinese MGSM set due to English‑only CoT fine‑tuning.

Machine Translation

Marco‑o1 demonstrates improved contextual understanding and nuanced translation quality, producing more natural and accurate translations compared with baseline models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

artificial intelligenceLLMmodel fine-tuningreasoningchain-of-thoughtmachine translationMonte Carlo Tree Search
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.