Tagged articles

Vision Transformers

13 articles · Page 1 of 1

May 17, 2026 · Artificial Intelligence

ViT³: Vision Test‑Time Training Architecture Breaking Transformer Complexity (CVPR 2026 Oral)

The paper systematically studies Test‑Time Training (TTT) for vision, derives six design principles, and introduces ViT³—a pure TTT architecture that uses full‑batch internal training, a learning rate of 1.0, and lightweight SwiGLU‑Depthwise convolution modules, achieving state‑of‑the‑art linear‑complexity performance across classification, detection, segmentation and generation tasks.

Linear ComplexityTest-Time TrainingVision Transformers

0 likes · 14 min read

ViT³: Vision Test‑Time Training Architecture Breaking Transformer Complexity (CVPR 2026 Oral)

Machine Heart

May 7, 2026 · Artificial Intelligence

OrthoReg: Simple Orthogonal Regularization to Eliminate Model Merging Conflicts

The paper introduces OrthoReg, a lightweight orthogonal regularization added during fine‑tuning that provably enforces weight orthogonality, thereby resolving conflicts in model merging and providing a theoretical explanation for the success of task arithmetic.

Deep LearningModel MergingOrthoReg

0 likes · 12 min read

OrthoReg: Simple Orthogonal Regularization to Eliminate Model Merging Conflicts

AIWalker

Mar 18, 2026 · Artificial Intelligence

7× Faster Inference: Tsinghua’s Huang‑Gao Team Redesigns Vision‑Transformer Attention via Fourier Transforms

The AAAI 2026 paper by Tsinghua’s Huang‑Gao team shows that modeling Vision‑Transformer attention as a Block‑Circulant matrix and computing it with FFT reduces the quadratic complexity to O(N log N), delivering up to seven‑fold real‑world speedups without sacrificing accuracy.

AAAI 2026Circulant MatricesEfficiency

0 likes · 15 min read

7× Faster Inference: Tsinghua’s Huang‑Gao Team Redesigns Vision‑Transformer Attention via Fourier Transforms

SuanNi

Feb 26, 2026 · Artificial Intelligence

How BitDance’s 2.6B‑Parameter Model Beats 14B Counterparts with 8.7× Speedup

BitDance’s new multimodal AI model achieves an 8.7‑fold inference acceleration using only 2.6 billion parameters, surpasses 14‑billion‑parameter state‑of‑the‑art architectures in image generation quality, and introduces binary visual tokens, a binary diffusion head, and next‑block diffusion for efficient parallel autoregressive prediction.

.aiBinary TokenizationModel Efficiency

0 likes · 11 min read

How BitDance’s 2.6B‑Parameter Model Beats 14B Counterparts with 8.7× Speedup

AI Frontier Lectures

Jul 8, 2025 · Artificial Intelligence

How LaVin-DiT Unifies Vision Tasks with a Large Diffusion Transformer

The LaVin-DiT paper presents a large vision diffusion transformer that integrates a spatio‑temporal variational auto‑encoder, a joint diffusion transformer with full‑sequence joint attention, and 3D rotary position encoding to enable unified, efficient multi‑task generation for images and videos, and details its training via flow‑matching and experimental results.

3D RoPEJoint Diffusion TransformerST-VAE

0 likes · 12 min read

How LaVin-DiT Unifies Vision Tasks with a Large Diffusion Transformer

AIWalker

Jun 24, 2025 · Artificial Intelligence

Mamba-Adaptor Merges Adaptor‑T and Adaptor‑S to Revolutionize Vision Tasks with State‑of‑the‑Art Benchmarks

The paper introduces Mamba-Adaptor, a plug‑and‑play module combining Adaptor‑T and Adaptor‑S to overcome causal computation, long‑range forgetting, and spatial modeling limits of visual Mamba models, delivering top‑ranked results on ImageNet and COCO across multiple downstream tasks.

AdaptorMambaState Space Model

0 likes · 25 min read

Mamba-Adaptor Merges Adaptor‑T and Adaptor‑S to Revolutionize Vision Tasks with State‑of‑the‑Art Benchmarks

AI Frontier Lectures

Jun 14, 2025 · Industry Insights

CVPR 2025 Awards Unveiled: Breakthrough Papers and Rising Stars

The CVPR 2025 awards spotlight groundbreaking research, honoring young scholars and top papers such as VGGT, Neural Inverse Rendering, and several honorable mentions, while summarizing each work's core contributions, methodologies, and potential impact on computer vision and related fields.

2025CVPRPaper Awards

0 likes · 13 min read

CVPR 2025 Awards Unveiled: Breakthrough Papers and Rising Stars

AI Frontier Lectures

Mar 14, 2025 · Artificial Intelligence

Do Vision Models Really Need Mamba? A Deep Dive into MambaOut

This article critically examines the MambaOut paper, analyzing whether state‑space‑based Mamba token mixers are necessary for vision tasks, presenting two hypotheses, describing the construction of MambaOut models without SSM, and reporting extensive ImageNet, COCO and ADE20K experiments that reveal when Mamba is beneficial.

Deep LearningMambaState Space Model

0 likes · 17 min read

Do Vision Models Really Need Mamba? A Deep Dive into MambaOut

AIWalker

Feb 26, 2025 · Artificial Intelligence

Why Linear Attention Lags Behind Softmax and How Two Simple Tweaks Close the Gap

The paper analytically identifies injectivity and local modeling as the two key factors causing the performance gap between linear and Softmax attention, proposes the InLine attention modifications to restore these properties, and demonstrates through extensive Vision Transformer experiments that the enhanced linear attention matches or surpasses Softmax while retaining linear computational cost.

Attention MechanismEfficient TransformersLinear Attention

0 likes · 24 min read

Why Linear Attention Lags Behind Softmax and How Two Simple Tweaks Close the Gap

Laiye Technology Team

Apr 1, 2022 · Artificial Intelligence

Self‑Supervised Learning: Contrastive Methods and the MoCo Series (V1‑V3)

This article introduces the four types of machine learning, explains self‑supervised learning, details generative and contrastive approaches, and provides an in‑depth overview of the MoCo series (V1‑V3), including their architectures, training strategies, and experimental results on document image classification and text‑line detection tasks.

MoCoVision Transformerscontrastive learning

0 likes · 12 min read

Self‑Supervised Learning: Contrastive Methods and the MoCo Series (V1‑V3)

Meituan Technology Team

Mar 24, 2022 · Artificial Intelligence

Twins: Efficient Visual Attention Models for Vision Transformers

The Twins series, a collaboration between Meituan and the University of Adelaide, introduces conditional positional encoding and spatially separable self‑attention to improve efficiency and performance of vision transformers, achieving state‑of‑the‑art results on ImageNet, ADE20K, COCO and high‑precision map segmentation.

ADE20KCOCOConditional Positional Encoding

0 likes · 20 min read

Twins: Efficient Visual Attention Models for Vision Transformers

Baobao Algorithm Notes

Jan 28, 2022 · Artificial Intelligence

How Masked Autoencoders Revolutionize Vision Pre‑Training: A Deep Dive

This article provides a detailed technical walkthrough of Masked Autoencoders (MAE) for computer vision, covering its BERT‑inspired masking strategy, asymmetric encoder‑decoder design, implementation specifics, experimental findings on mask ratios and decoder depth, and the resulting performance gains over supervised ViT models.

MAEMasked ModelingPyTorch

0 likes · 11 min read

How Masked Autoencoders Revolutionize Vision Pre‑Training: A Deep Dive

Code DAO

Dec 8, 2021 · Artificial Intelligence

Understanding Compact Transformers: Build and Train Vision & NLP Models on a Personal PC

This article walks through the design of Compact Transformers, explaining scaled dot‑product self‑attention, positional embeddings, multi‑head attention, and Vision Transformer architecture, and provides full PyTorch code so readers can train lightweight CV and NLP classifiers on a single PC.

Compact TransformersMulti-Head AttentionPatch Embedding

0 likes · 19 min read

Understanding Compact Transformers: Build and Train Vision & NLP Models on a Personal PC