Artificial Intelligence 12 min read

Can Rendering Thought Chains as Images Speed Up LLM Reasoning?

This article introduces Render‑of‑Thought (RoT), a novel paradigm that compresses chain‑of‑thought reasoning into visual embeddings using frozen vision encoders, achieving 3‑4× token reduction, faster inference, and improved interpretability while requiring minimal pre‑training.

Tencent Technical Engineering

Jan 30, 2026

Can Rendering Thought Chains as Images Speed Up LLM Reasoning?

Introduction

Chain‑of‑Thought (CoT) has become the standard for complex reasoning with large language models, but its reliance on generating long textual intermediate steps leads to higher latency and excessive KV‑cache memory consumption. Existing explicit CoT compression methods are limited by discrete text representations, while early implicit CoT approaches (e.g., Coconut, CoLaR) suffer from unstable training and lack of interpretability.

Render‑of‑Thought (RoT) Overview

RoT proposes a new paradigm that renders each reasoning step as an image and aligns the LLM’s hidden states with the visual embeddings produced by a frozen visual encoder from a vision‑language model (VLM). By leveraging the high information density of visual tokens, RoT achieves a 3‑4× token compression and makes the otherwise opaque implicit reasoning process analyzable.

Method

Phase 1: Visual Alignment

This stage freezes both the LLM and the visual encoder, training only a lightweight visual projection head to map the LLM’s latent states to the feature space of rendered CoT images. The alignment loss is computed between the projected latent embedding and the target vision embedding, as illustrated below:

Additionally, a cross‑entropy loss is applied to the special token <|img_end|> and the answer token.

The overall Phase 1 loss is a weighted sum of the alignment loss and the special‑token cross‑entropy.

Phase 2: Latent Supervised Fine‑Tuning

After alignment, LoRA fine‑tuning is applied to the LLM while keeping the visual projection head frozen. The model no longer generates textual tokens; instead, it autoregressively produces a sequence of latent visual tokens that simulate the output of the visual encoder, which are then decoded into the final answer.

Inference and Decoding Strategies

Two termination strategies are explored:

Dynamic termination with a special token : inference stops at the first timestep when the probability of the termination token reaches its maximum.

Static termination with a fixed token budget : a predetermined token budget (e.g., 32 tokens for GSM8k‑Aug, 64 for MATH) triggers insertion of <|img_end|> to switch from latent inference to text generation.

Empirically, the static budget strategy outperforms the dynamic one, likely due to instability in the latent space’s self‑regulating stop signal.

Experiments

Compression and Speedup : RoT achieves 3‑4× token compression compared with explicit CoT, delivering markedly higher Pass@1/L scores on Qwen3‑VL‑4B.

Superiority over Existing Implicit Methods : On the MultiArith benchmark, RoT (Qwen3‑VL‑4B) reaches 97.2% accuracy, surpassing Coconut and CoLaR.

Interpretability of Latent Reasoning : Aligning hidden states to visual space enables heat‑map visualizations that reveal structured token similarity patterns, confirming logical latent reasoning rather than random vector generation.

Ablation Studies

Removing Phase 1 drops MATH accuracy from 33.2% to 22.2%, highlighting the importance of visual alignment for preventing representation collapse. Omitting Phase 2 also degrades performance, showing that latent‑space decoding is essential for producing correct answers.

Future Work

Adaptive inference length and stopping mechanisms to replace the fixed token budget.

Full visualisation and reverse decoding of latent variables to break the remaining black‑box nature.

Broader evaluation across commonsense reasoning, code generation, multilingual and multimodal tasks to verify RoT’s generality.

Preliminary attempts to apply RoT in real‑world content‑understanding scenarios show promising trade‑offs between accuracy and inference cost.

inference optimization multimodal chain of thought latent space Vision-Language token compression

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Render‑of‑Thought (RoT) Overview

Method

Phase 1: Visual Alignment

Phase 2: Latent Supervised Fine‑Tuning

Inference and Decoding Strategies

Experiments

Ablation Studies

Future Work

Tencent Technical Engineering

How this landed with the community

Was this worth your time?

0 Comments

Phase 1: Visual Alignment

Phase 2: Latent Supervised Fine‑Tuning