CAS-ViT: The Fastest, Strongest Vision Transformer for Mobile Image Classification & Detection
CAS‑ViT introduces a convolutional additive self‑attention mechanism that dramatically reduces the computational cost of Vision Transformers, achieving state‑of‑the‑art accuracy on image classification, object detection, and segmentation while being deployable on mobile devices.
Problem
Standard Vision Transformers (ViTs) rely on multi‑head self‑attention (MSA) whose dot‑product and softmax operations have quadratic complexity with respect to the token count. This makes inference on resource‑constrained platforms (e.g., smartphones) prohibitively expensive.
Design Goal
Replace the matrix‑based token mixer with a lightweight operation that preserves global context while scaling linearly with the input size.
Convolutional Additive Token Mixer (CATM)
The authors observe that effective token mixing stems from multiple information interactions across spatial and channel dimensions. They therefore define a novel additive similarity function that sums two background scores produced by separate linear projections of the query (Q), key (K) and value (V). The similarity is computed without dot‑product or softmax.
Formulation :
score = Sigmoid(Conv_Channel(Q)) * Sigmoid(Conv_Spatial(K))where Conv_Channel and Conv_Spatial are 1×1 depthwise convolutions followed by a sigmoid activation. The output of the CATM module is the element‑wise product of the channel‑wise and spatial‑wise attention maps, which is then used to weight the value tensor.
Because all operations are convolutional, the computational cost is O(N·C) (linear in token number N and channel count C), eliminating the O(N²) term of conventional QKV projections.
Comparison with prior efficient attention:
Mehta & Rastegari (2022) separate Q and K branches but collapse the feature dimension to a 2‑D score vector, losing information. CATM retains the full‑dimensional features in each branch.
Efficient additive attention (e.g., SwiftFormer) applies a sigmoid‑based attention only at a single stage. CATM inserts the additive mixer into every transformer layer, providing consistent global context throughout the network.
Network Architecture
The backbone follows a four‑stage encoder design (Figure 3). An input image is down‑sampled twice by stride‑2 convolutions, reducing spatial resolution while increasing channel width. Each stage contains a stack of identical blocks; the number of blocks and channel widths define model variants (XS, S, etc.).
Each block consists of three sub‑modules:
Fusion subnet : three depthwise convolutions with ReLU activation that mix local information.
CATM module (described above) that provides global token mixing.
MLP : a point‑wise feed‑forward network.
The design mirrors hybrid CNN‑Transformer approaches (e.g., EfficientViT, EdgeViT) but replaces the self‑attention head with CATM, yielding a uniform architecture across all stages.
Training Procedure
Training follows the EdgeNeXt recipe on ImageNet‑1K without any external data or pre‑training.
Input resolution: 224×224.
Optimizer: AdamW.
Learning‑rate schedule: cosine decay with a 20‑epoch warm‑up.
Initial learning rate: 6×10⁻³.
Batch size: 2048.
Total epochs: 300.
Label smoothing: 0.1.
Data augmentations: random resized cropping, horizontal flip, RandAugment, multi‑scale sampling.
EMA momentum: 0.9995.
After the 300‑epoch pre‑training, models are fine‑tuned for 30 epochs at a higher resolution (384×384) with a learning rate of 1×10⁻³ and batch size 64.
Implementation details:
Framework: PyTorch 1.14 with TIMM 1.
Hardware: 16 × NVIDIA V100 GPUs.
Export: ONNX conversion for cross‑platform inference.
Mobile deployment: CoreML compilation and throughput measurement on iPhone X Neural Engine.
Experimental Evaluation
Benchmarks cover three major vision tasks:
ImageNet‑1K classification.
COCO object detection.
ADE20K semantic segmentation.
Table 1 (referenced in the paper) shows that every CAS‑ViT variant achieves a higher Top‑1 accuracy than MobileNetV3 and other efficient ViTs while using fewer parameters and lower FLOPs. For example, the XS model attains the best accuracy‑efficiency trade‑off among million‑parameter models.
Throughput measurements:
GPU (V100) inference at batch size 64.
ONNX runtime on Intel Xeon Gold CPU @ 3.00 GHz (batch size 64).
ANE (iPhone X) real‑time inference, confirming suitability for mobile deployment.
The authors also report that the additive mixer introduces negligible latency compared with standard MSA, yet consistently improves accuracy across all three tasks.
Conclusion
By substituting the quadratic dot‑product‑softmax pipeline with a convolution‑based additive similarity, CAS‑ViT delivers a linear‑complexity token mixer that retains full‑dimensional feature interactions. The resulting models achieve state‑of‑the‑art accuracy on classification, detection, and segmentation while meeting the strict latency and memory budgets of mobile devices.
Code and pretrained checkpoints are publicly available at https://github.com/Tianfang-Zhang/CAS-ViT.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
