ViT³: Vision Test‑Time Training Architecture Breaking Transformer Complexity (CVPR 2026 Oral)
The paper systematically studies Test‑Time Training (TTT) for vision, derives six design principles, and introduces ViT³—a pure TTT architecture that uses full‑batch internal training, a learning rate of 1.0, and lightweight SwiGLU‑Depthwise convolution modules, achieving state‑of‑the‑art linear‑complexity performance across classification, detection, segmentation and generation tasks.
Introduction
Sequence modeling is fundamental to large language models and computer vision. Standard Transformers have quadratic complexity with respect to sequence length, which limits their scalability on long‑sequence tasks. Test‑Time Training (TTT) redefines attention as an online learning process that builds a lightweight internal model from key‑value pairs at inference time, opening a design space with linear complexity.
Reinterpreting Attention
Both softmax and linear attention can be viewed as constructing a small model from keys and values: softmax attention builds a two‑layer MLP without compression, while linear attention compresses keys and values into a d×d matrix, reducing cost but hurting performance. The core question is whether compression can be achieved without sacrificing accuracy.
TTT Mechanism
TTT treats the set of keys and values as a tiny dataset. During each inference step, the internal model is trained to reconstruct values from keys using a self‑supervised loss (e.g., L2). After a few gradient updates, the updated internal model processes the query via a single forward pass. The overall computation cost is proportional to the internal model’s complexity.
Design Observations for Vision TTT
1. Second‑order mixed derivatives : Loss functions whose second‑order mixed derivative vanishes (e.g., MAE/L1) lead to near‑zero external gradients and cannot train TTT effectively.
2. Full‑batch internal training : Using all N key‑value pairs (B=N) yields better performance than mini‑batch updates, which introduce unnecessary causal dependencies for vision tasks.
3. Large internal learning rate : An internal learning rate around 1.0 balances rapid weight updates and training stability; rates that are too low or too high degrade performance.
4. Increasing internal model capacity : Expanding the width (hidden dimension / input dimension) of a two‑layer MLP with SiLU activation consistently improves accuracy, demonstrating that larger internal models enhance sequence‑modeling ability.
5. Depth vs. optimization : Deeper internal models (e.g., three‑layer MLP) exhibit higher training and test loss, indicating optimization difficulties rather than over‑fitting; shallow models train more effectively.
6. Convolutional internal models : Implementing the internal model as a small 3×3 or depthwise convolution yields significant gains, leveraging local and global information naturally suited for vision.
ViT³ Architecture
Guided by the observations, the authors propose ViT³, a pure TTT vision model that adopts:
Full‑batch internal gradient descent with learning rate 1.0.
Point‑wise loss for reconstruction.
An internal model combining a simplified SwiGLU MLP and a depthwise convolution.
ViT³ can replace the attention block in any vision Transformer backbone (e.g., DeiT‑S) and is evaluated on ImageNet‑1K classification, high‑resolution object detection, segmentation, and image generation.
Experimental Results
Across all tasks, ViT³ surpasses existing linear‑complexity designs such as linear attention and visual Mamba models, confirming the effectiveness of the TTT paradigm. Detailed results (see Figures 7‑10) show higher accuracy and competitive throughput while maintaining linear computational cost.
Conclusion and Outlook
The study systematically maps the design space of visual Test‑Time Training, distills six practical principles, and delivers a strong baseline (ViT³) for efficient, high‑expressivity sequence modeling in computer vision. Future work includes addressing optimization challenges of deeper internal models and exploring richer internal architectures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
