Can Multi‑Round Thinking Boost LLM Accuracy Without Extra Training?

A new study from the a‑m‑team introduces “Think Twice”, a test‑time multi‑round reasoning technique that, without additional training or model changes, repeatedly prompts large language models to self‑correct, yielding notable accuracy gains across benchmarks such as AIME, MATH‑500, GPQA‑Diamond and LiveCodeBench, while also producing shorter, more confident answers.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Can Multi‑Round Thinking Boost LLM Accuracy Without Extra Training?

Background

Think Twice is a test‑time inference strategy that improves reasoning of large language models (LLMs) without any additional training or architectural changes.

Method: Multi‑round test‑time thinking

The model first generates an answer to a question. That answer is then used as a new prompt for a second (or subsequent) generation. Each round only receives the previous answer as context, allowing the model to “re‑answer” independently and correct earlier mistakes. This result‑driven self‑correction mitigates “cognitive inertia” where the model sticks to an initial reasoning path.

Evaluation datasets

AIME 2024 (American Invitational Mathematics Examination)

MATH‑500 (500 hardest problems from the MATH dataset)

GPQA‑Diamond (graduate‑level question answering)

LiveCodeBench (programming tasks)

Results

Across four benchmarks, several state‑of‑the‑art models show consistent accuracy gains when using 2‑4 thinking rounds.

DeepSeek‑R1 on AIME: 79.7 % → 82.0 %

QwQ‑32B on AIME: 80.3 % → 83.1 %

Additional rounds further increase accuracy, indicating improved stability and reflective capability.

Language style analysis

Frequency analysis of discourse markers shows a reduction of uncertainty words (“but”, “maybe”, “wait”) and an increase of transitional terms (“therefore”) in later rounds, especially when the model corrects an error. Answers become shorter, more confident, and more logically structured.

Word frequency changes across reasoning rounds
Word frequency changes across reasoning rounds

Practical advantages

The technique operates entirely at inference time, requiring no extra training resources and can be applied as a plug‑and‑play wrapper to deployed models. The authors also explored using multi‑round outputs as supervision for further fine‑tuning; early experiments show modest improvements, suggesting a path toward combined training‑and‑inference reflection.

Conclusion

Think Twice demonstrates that a simple multi‑round reflection loop can substantially boost LLM accuracy and produce more concise, confident answers without any model modification. It offers an immediate, lightweight optimization for deployed systems and opens research directions for integrated multi‑round reasoning mechanisms.

Paper: https://arxiv.org/abs/2503.19855

Code repository: https://github.com/a-m-team/a-m-models

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

artificial intelligenceLLMbenchmark evaluationMulti-round reasoningTest-time optimization
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.