12 min read

Can Large Models Reason Deeply with Only a Few Thinking Tokens?

The paper introduces Heima, a framework that compresses chain‑of‑thought reasoning into a small set of abstract “thinking tokens” for multimodal large models, dramatically reducing generated tokens while preserving inference capability, and provides an adaptive interpreter to reconstruct human‑readable reasoning for analysis.

Machine Heart

May 18, 2026

Can Large Models Reason Deeply with Only a Few Thinking Tokens?

Background

Chain‑of‑Thought (CoT) reasoning improves complex problem solving for large language models (LLMs) and multimodal large language models (MLLMs) but generates large intermediate text, increasing token count, latency, memory and compute cost.

Core Question

Can multimodal large models replace explicit CoT text with a few implicit “thinking tokens” for inference?

Method – Heima

Heima compresses CoT into a small set of abstract thinking tokens and performs inference in hidden space. It comprises three designs:

Thinking tokens replace verbose CoT – The model emits special tokens such as <Thinking_of_Summary>, <Thinking_of_Caption>, <Thinking_of_Reasoning> instead of step‑by‑step natural‑language reasoning. The hidden states of these tokens encode the corresponding reasoning stage.

Progressive distillation – CoT stages are distilled into thinking tokens gradually, stage by stage, rather than compressing the entire chain at once, which smooths the transition and preserves performance.

Adaptive interpreter – A separate LLM‑based interpreter maps thinking tokens back to variable‑length text, reconstructing human‑readable reasoning and allowing measurement of information loss.

Illustrative Example

For an image of a black car with a distinctive badge, a traditional CoT might generate:

“This image shows a black car. The front has a special badge. The badge corresponds to BMW. Therefore the answer is BMW.”

Heima replaces the verbose text with:

<Thinking_of_Summary> <Thinking_of_Caption> <Thinking_of_Reasoning> ， conclusion: the image depicts a black BMW M3 on the road.

Theoretical Analysis

Let the original CoT be C, the input question X, and the compressed thinking tokens T = f(X, C). By the data‑processing inequality, T cannot contain more information about the answer Y than C. If the conditional mutual information I(T;Y|X) remains high, the compressed representation retains the essential reasoning. The gap I(C;Y|X,T) quantifies the information lost by compression; a small gap indicates that T captures the critical reasoning information.

Experimental Evaluation

Heima was tested on several multimodal reasoning benchmarks. Compared with full CoT, Heima consistently reduced generated token counts while maintaining or slightly improving accuracy. The adaptive interpreter successfully reconstructed reasoning for summary, caption, and reasoning stages, demonstrating that the thinking tokens preserve usable information. Code and model checkpoints are available at https://github.com/shawnricecake/Heima.

Conclusion

Heima shows that multimodal large models can achieve efficient inference by compressing CoT into a few hidden‑space tokens without sacrificing performance, and that an adaptive interpreter provides a window into the latent reasoning process.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI Chain-of-Thought Efficient Inference latent reasoning thinking tokens

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.