Why Is Visual Latent Reasoning Unstable? Uncovering the Feature‑Space Gap
The paper identifies a feature‑space mismatch that makes visual latent reasoning unstable, proposes the Granular Alignment Paradigm (GAP) with data, feature, and model‑capacity alignment, and demonstrates through extensive experiments that GAP improves both visual perception and multimodal reasoning performance.
Visual latent reasoning aims to let multimodal models generate continuous visual latent tokens as intermediate evidence for downstream tasks, but these tokens often fall outside the model's familiar visual input space, leading to instability.
Feature‑Space Mismatch
Many existing methods adopt an "output‑as‑input" paradigm, feeding decoder hidden states directly as latent token embeddings. In pre‑norm Transformers, decoder hidden states grow in norm across layers, diverging from the scale of input text and visual embeddings. Measurements on Monet‑7B (based on Qwen2.5‑VL 7B) show that output text hidden states have an L2 norm ~546.4× larger than input text embeddings, and visual hidden states ~8.7× larger than visual input embeddings. This norm growth persists after latent fine‑tuning, indicating a systematic feature‑space misalignment.
Granular Alignment Paradigm (GAP)
GAP addresses the mismatch through three simultaneous alignments:
Data Alignment : Each training sample pairs a continuous latent target with a readable visual description ( <parser>) in the teacher response, making the latent supervision both visual and semantic.
Feature Alignment : A PCA‑aligned latent head predicts low‑rank PCA coefficients instead of full‑dimensional embeddings, reducing the parameter space from D×D to D×d (d≪D) while preserving 95% variance, thus constraining generated latents to the model‑known visual subspace.
Model‑Capacity Alignment : Latent supervision is applied only to samples where the base model fails after eight sampling attempts, preventing unnecessary noise on easy examples.
Norm Calibration Experiment
Using Monet‑7B as a baseline, the authors apply an EMA‑based norm calibration at inference time, rescaling predicted latents to match the norm of input visual embeddings without any training. This simple intervention raises HRBench4K from 70.75 to 71.63 and MathVista from 61.30 to 63.30, improving the average score from 66.03 to 67.46.
Feature Alignment via PCA Head
The PCA head predicts 629 coefficients (95% variance) instead of the full 3584‑dimensional embedding, reducing dimensionality to 17.6% of the original. This constrains latents to the principal directions of visual embeddings, mitigating the feature‑space gap.
Model‑Capacity Alignment Details
During training, each question is sampled eight times with Qwen2.5‑VL 7B. Only samples where all eight attempts fail receive latent supervision; others are trained with pure text. This difficulty‑aware allocation ensures latent tokens are used where they provide the most benefit.
Main Results
Under the authors' evaluation protocol, GAP simultaneously improves average perception (Avg‑P) and average reasoning (Avg‑R) scores. On HRBench4K, MMStar, and MME‑RealWorld‑Lite, GAP achieves an Avg‑P of 61.32, surpassing the base model (57.66) and other baselines. On MathVista and WeMath, GAP reaches an Avg‑R of 53.97, the highest among all methods.
Component Analysis
Comparisons among full latent models, selective latent supervision (49K curated samples), and difficulty‑aware versions show that both clean supervision and proper allocation contribute to GAP's gains. Dimensionality reduction experiments reveal that a 95% variance PCA head outperforms a full‑dimensional head, confirming the value of low‑rank priors.
Latent Token Budget
The authors explore token budgets organized as square grids (e.g., 4 tokens → 2×2). Results indicate non‑monotonic behavior: 36 tokens achieve the best Avg‑3 score (69.22), while larger budgets (64, 144) do not yield further improvements. The optimal budget depends on image resolution and task granularity.
Conclusion
GAP demonstrates that the core issue in visual latent reasoning is a gap between output and input spaces. By aligning data, features, and model capacity, GAP closes this gap, delivering consistent improvements in both visual perception and multimodal reasoning.
<think> 文本推理上下文
→ <latent> visual latent tokens </latent>
→ <parser>这段 latent 预期表达的辅助视觉证据</parser>
→ 继续文本推理
</think>
→ <answer>最终答案</answer>Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
