Why Is Visual Latent Reasoning Unstable? Uncovering the Feature‑Space Gap

The paper identifies a feature‑space mismatch that makes visual latent reasoning unstable, proposes the Granular Alignment Paradigm (GAP) with data, feature, and model‑capacity alignment, and demonstrates through extensive experiments that GAP improves both visual perception and multimodal reasoning performance.

Machine Heart
Machine Heart
Machine Heart
Why Is Visual Latent Reasoning Unstable? Uncovering the Feature‑Space Gap

Visual latent reasoning aims to let multimodal models generate continuous visual latent tokens as intermediate evidence for downstream tasks, but these tokens often fall outside the model's familiar visual input space, leading to instability.

Feature‑Space Mismatch

Many existing methods adopt an "output‑as‑input" paradigm, feeding decoder hidden states directly as latent token embeddings. In pre‑norm Transformers, decoder hidden states grow in norm across layers, diverging from the scale of input text and visual embeddings. Measurements on Monet‑7B (based on Qwen2.5‑VL 7B) show that output text hidden states have an L2 norm ~546.4× larger than input text embeddings, and visual hidden states ~8.7× larger than visual input embeddings. This norm growth persists after latent fine‑tuning, indicating a systematic feature‑space misalignment.

Granular Alignment Paradigm (GAP)

GAP addresses the mismatch through three simultaneous alignments:

Data Alignment : Each training sample pairs a continuous latent target with a readable visual description ( <parser>) in the teacher response, making the latent supervision both visual and semantic.

Feature Alignment : A PCA‑aligned latent head predicts low‑rank PCA coefficients instead of full‑dimensional embeddings, reducing the parameter space from D×D to D×d (d≪D) while preserving 95% variance, thus constraining generated latents to the model‑known visual subspace.

Model‑Capacity Alignment : Latent supervision is applied only to samples where the base model fails after eight sampling attempts, preventing unnecessary noise on easy examples.

Norm Calibration Experiment

Using Monet‑7B as a baseline, the authors apply an EMA‑based norm calibration at inference time, rescaling predicted latents to match the norm of input visual embeddings without any training. This simple intervention raises HRBench4K from 70.75 to 71.63 and MathVista from 61.30 to 63.30, improving the average score from 66.03 to 67.46.

Feature Alignment via PCA Head

The PCA head predicts 629 coefficients (95% variance) instead of the full 3584‑dimensional embedding, reducing dimensionality to 17.6% of the original. This constrains latents to the principal directions of visual embeddings, mitigating the feature‑space gap.

Model‑Capacity Alignment Details

During training, each question is sampled eight times with Qwen2.5‑VL 7B. Only samples where all eight attempts fail receive latent supervision; others are trained with pure text. This difficulty‑aware allocation ensures latent tokens are used where they provide the most benefit.

Main Results

Under the authors' evaluation protocol, GAP simultaneously improves average perception (Avg‑P) and average reasoning (Avg‑R) scores. On HRBench4K, MMStar, and MME‑RealWorld‑Lite, GAP achieves an Avg‑P of 61.32, surpassing the base model (57.66) and other baselines. On MathVista and WeMath, GAP reaches an Avg‑R of 53.97, the highest among all methods.

Component Analysis

Comparisons among full latent models, selective latent supervision (49K curated samples), and difficulty‑aware versions show that both clean supervision and proper allocation contribute to GAP's gains. Dimensionality reduction experiments reveal that a 95% variance PCA head outperforms a full‑dimensional head, confirming the value of low‑rank priors.

Latent Token Budget

The authors explore token budgets organized as square grids (e.g., 4 tokens → 2×2). Results indicate non‑monotonic behavior: 36 tokens achieve the best Avg‑3 score (69.22), while larger budgets (64, 144) do not yield further improvements. The optimal budget depends on image resolution and task granularity.

Conclusion

GAP demonstrates that the core issue in visual latent reasoning is a gap between output and input spaces. By aligning data, features, and model capacity, GAP closes this gap, delivering consistent improvements in both visual perception and multimodal reasoning.

Illustration of GAP
Illustration of GAP
Motivation for intermediate visual evidence
Motivation for intermediate visual evidence
<think> 文本推理上下文
 → <latent> visual latent tokens </latent>
 → <parser>这段 latent 预期表达的辅助视觉证据</parser>
 → 继续文本推理
</think>
→ <answer>最终答案</answer>
Norm growth in Monet‑7B
Norm growth in Monet‑7B
Effect of EMA norm calibration
Effect of EMA norm calibration
GAP performance table
GAP performance table
Reasoning performance table
Reasoning performance table
Latent generation vs. noise vs. disabled
Latent generation vs. noise vs. disabled
Token budget grid
Token budget grid
Summary illustration
Summary illustration
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodal LLMtoken budgetfeature alignmentGranular Alignment ParadigmPCA alignmentvisual latent reasoning
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.