Why Longer Token Chains Don't Mean Better Reasoning: Google's Deep Thinking Ratio

Google’s recent study shows that the length of a model’s token chain is negatively correlated with inference accuracy, and introduces the Deep Thinking Ratio (DTR) metric to identify truly reasoning tokens, enabling the Think@n strategy to halve compute cost without sacrificing performance.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Why Longer Token Chains Don't Mean Better Reasoning: Google's Deep Thinking Ratio

Token length vs inference quality

Google research evaluated eight large‑language‑model variants—including GPT‑OSS, DeepSeek‑R1, and Qwen‑3—on four inference benchmarks (AIME 2024/2025, HMMT 2025, GPQA‑Diamond). The Pearson correlation between total generated token count and answer accuracy was –0.54, showing that longer output sequences often degrade performance.

Functional vs deep‑thinking tokens

Generated tokens are divided into two groups:

Functional tokens – e.g., “and”, “is”, “the”, which are resolved quickly in shallow layers.

Deep‑thinking tokens – e.g., “the result is 10”, “option A”, whose prediction distributions continue to change in deeper layers.

The study measures Jensen‑Shannon Divergence (JSD) across model layers; a token is labeled deep‑thinking if its distribution stabilizes only in deeper layers. The Deep‑Thinking Ratio (DTR) is defined as the proportion of deep‑thinking tokens in the complete generated sequence.

DTR as a quality indicator

On the four test sets, DTR correlates positively with inference accuracy (Pearson r = 0.82), a stark contrast to the –0.54 correlation observed for raw token length.

Think@n inference strategy

For each query, multiple inference samples are generated. DTR is quickly estimated from the first 50 tokens of each sample. The top 50 % high‑DTR samples are kept for full decoding and majority voting, while low‑DTR samples are discarded early.

This early‑stage filtering cuts total token consumption from 355.6 k to 181.9 k (≈ 50 % reduction) without sacrificing accuracy. For example, GPT‑OSS‑120B‑medium reaches 94.7 % accuracy on AIME 2025, compared with 92.7 % using the traditional approach.

Paper: https://arxiv.org/abs/2602.13517

LLMInferenceToken EfficiencyDeep Thinking RatioThink@n
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.