Multi-View Transformer (MVFormer) Sets New Top‑1 Accuracy Records in Classification, Detection, and Segmentation
The paper proposes MVFormer, a Vision Transformer that combines a Multi‑View Normalization (MVN) module and a Multi‑View Token Mixer (MVTM) to diversify feature learning, achieving state‑of‑the‑art Top‑1 accuracy of 83.4%‑84.6% on ImageNet‑1K and superior performance on COCO detection and ADE20K segmentation while using comparable or fewer parameters and MACs.
Introduction
Vision Transformers (ViTs) achieve strong performance but most work concentrates on improving the token‑mixing operator, while the impact of normalization is rarely explored. The authors address this gap by introducing two complementary modules:
Multi‑View Normalization (MVN) : a learnable weighted sum of BatchNorm (BN), LayerNorm (LN) and InstanceNorm (IN) features.
Multi‑View Token Mixer (MVTM) : a depth‑wise separable convolutional mixer that processes three channel groups with distinct receptive fields (local, intermediate, global) and adapts the kernel sizes per stage.
Both modules are inserted into a MetaFormer‑style ViT backbone, yielding the MVFormer family.
Method
3.1 Preliminaries
MetaFormer abstracts a ViT as a sequence of Norm → TokenMixer → MLP blocks, leaving the token mixer unspecified. The baseline adopts ConvFormer’s depth‑wise separable convolution as the token mixer.
3.2 Multi‑View Transformer
3.2.1 Multi‑View Normalization (MVN)
Given an input tensor X ∈ ℝ^{B×N×C}, three parallel normalizations produce X_{BN}, X_{LN} and X_{IN}. Learnable scalar weight vectors w_{BN}, w_{LN}, w_{IN} ∈ ℝ^{C} (one per embedding dimension) are applied and summed:
Y = w_{BN} ⊙ X_{BN} + w_{LN} ⊙ X_{LN} + w_{IN} ⊙ X_{IN}The weights are trained jointly with the rest of the network, adding only 3 × C parameters and negligible FLOPs. This design lets the model simultaneously exploit batch‑level, channel‑level and sample‑level statistics.
3.2.2 Multi‑View Token Mixer (MVTM)
MVTM splits the channel dimension into three groups ( C_{loc}, C_{mid}, C_{glo}) satisfying C_{loc}+C_{mid}+C_{glo}=C. Each group passes through a depth‑wise convolution with a different kernel size:
Local filter: small kernel (e.g., 3×3) for fine‑grained patterns.
Intermediate filter: medium kernel (e.g., 5×5) to bridge local and global contexts.
Global filter: large kernel (e.g., 7×7) for long‑range interactions.
After depth‑wise convolution, a point‑wise 1×1 convolution mixes the channels, and the three outputs are concatenated. Stage‑specific configurations adjust the channel ratios and the global kernel size; early stages allocate more channels to the local filter, while later stages increase the global filter’s share and shrink its kernel, matching the decreasing spatial resolution of the feature map.
3.2.3 MVFormer Block
The MVFormer block places MVN before the token mixer and again inside the MLP sub‑block, encouraging synergistic interaction between normalization and mixing. Both the token mixer and the MLP use the StarReLU activation function.
3.2.4 Overall Architecture
Four model sizes are defined:
MVFormer‑xT (tiny)
MVFormer‑T (tiny)
MVFormer‑S (small)
MVFormer‑B (base)
All share the same MetaFormer backbone augmented with MVN and MVTM; parameter counts and MACs increase from xT to B.
Experiments
4.1 Image Classification
Training on ImageNet‑1K (1.28 M images, 1 K classes) follows the recipe:
300 epochs, batch size 4096
AdamW optimizer, weight decay 0.05, base LR 4e‑3
Cosine‑annealing LR with 20‑epoch warm‑up
Data augmentations: RandAugment, Random Erasing, Mixup, CutMix, label smoothing
Random depth probabilities: 0.2 – 0.4 per variant
Results (Top‑1 accuracy):
MVFormer‑T = 83.4 %
MVFormer‑S = 84.3 %
MVFormer‑B = 84.6 %
Each variant outperforms the strongest convolution‑based baselines (ConvFormer‑S18, S36, M36) by 0.1‑0.4 % while using comparable or fewer MACs.
4.2 Object Detection & Instance Segmentation
On COCO 2017 (118 K train, 5 K val) the authors fine‑tune MVFormer‑T and MVFormer‑S as backbones for Mask RCNN and RetinaNet (implemented with mmdetection). Training uses a single‑scale 800‑pixel short side, learning rates 2e‑4 (Mask RCNN) and 1e‑4 (RetinaNet) with step decays at epochs 8/11 and 27/33 respectively. Both variants achieve the highest mean Average Precision (mAP) among ViT‑based detectors while requiring fewer parameters than competing models.
4.3 Semantic Segmentation
On ADE20K (20 K train, 2 K val) MVFormer‑T and MVFormer‑S serve as backbones for a Semantic FPN (mmsegmentation). Training runs for 40 K iterations, batch size 32, AdamW with cosine LR. The models improve mean IoU by 0.4 % (T) and 0.7 % (S) over the best convolution‑based ViTs (VAN‑B2/B3) at similar efficiency.
4.4 Ablation Studies
4.4.1 Individual Modules
Using the MVFormer‑xT backbone on ImageNet‑1K:
Adding MVN alone yields +0.53 % Top‑1.
Adding MVTM alone yields +0.17 % Top‑1.
Combining both reaches 81.30 % Top‑1, confirming complementary benefits.
4.4.2 Normalization Combinations
All three normalizations together (BN + LN + IN) outperform any pair. IN alone degrades performance, but its inclusion with BN/LN provides a synergistic gain, supporting the hypothesis that batch‑level, channel‑level and sample‑level statistics jointly enrich feature diversity.
4.4.3 MVN on Existing Architectures
Replacing LN with MVN in Swin‑T, ConvFormer‑S18, ConvNeXt‑T and PoolFormer‑S36, and replacing BN with MVN in ResNet‑50, consistently raises Top‑1 accuracy by ≈0.2 % across all five models, demonstrating MVN’s broad applicability.
4.4.4 MVTM Design
Removing the smallest (local) filter markedly reduces accuracy, indicating that fine‑grained filters are essential for covering diverse visual patterns. The stage‑specific channel ratios and global‑kernel sizes (listed in Table 1 of the original paper) are crucial for balancing local detail and global context.
4.4.5 Learned MVN Weights
Analysis of the learned weight vectors shows a consistent trend: LN receives the highest proportion in later stages, while BN and IN dominate early stages. This reflects a bias toward channel‑level statistics early in the network and spatial‑level statistics later, aligning with the intuition that early layers benefit from batch‑level stability and later layers from spatial diversity.
Conclusion
Integrating multi‑view normalization and a stage‑aware multi‑scale token mixer yields consistent accuracy improvements across classification, detection and segmentation tasks without increasing model size or computational cost. Extensive ablations confirm that each component contributes uniquely, and the approach can be transplanted into existing ViT and CNN backbones.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
