Artificial Intelligence 25 min read

Multi-View Transformer (MVFormer) Sets New Top‑1 Accuracy Records in Classification, Detection, and Segmentation

The paper proposes MVFormer, a Vision Transformer that combines a Multi‑View Normalization (MVN) module and a Multi‑View Token Mixer (MVTM) to diversify feature learning, achieving state‑of‑the‑art Top‑1 accuracy of 83.4%‑84.6% on ImageNet‑1K and superior performance on COCO detection and ADE20K segmentation while using comparable or fewer parameters and MACs.

AIWalker

Jan 13, 2025

Multi-View Transformer (MVFormer) Sets New Top‑1 Accuracy Records in Classification, Detection, and Segmentation

Introduction

Vision Transformers (ViTs) achieve strong performance but most work concentrates on improving the token‑mixing operator, while the impact of normalization is rarely explored. The authors address this gap by introducing two complementary modules:

Multi‑View Normalization (MVN) : a learnable weighted sum of BatchNorm (BN), LayerNorm (LN) and InstanceNorm (IN) features.

Multi‑View Token Mixer (MVTM) : a depth‑wise separable convolutional mixer that processes three channel groups with distinct receptive fields (local, intermediate, global) and adapts the kernel sizes per stage.

Both modules are inserted into a MetaFormer‑style ViT backbone, yielding the MVFormer family.

Method

3.1 Preliminaries

MetaFormer abstracts a ViT as a sequence of Norm → TokenMixer → MLP blocks, leaving the token mixer unspecified. The baseline adopts ConvFormer’s depth‑wise separable convolution as the token mixer.

3.2 Multi‑View Transformer

3.2.1 Multi‑View Normalization (MVN)

Given an input tensor X ∈ ℝ^{B×N×C}, three parallel normalizations produce X_{BN}, X_{LN} and X_{IN}. Learnable scalar weight vectors w_{BN}, w_{LN}, w_{IN} ∈ ℝ^{C} (one per embedding dimension) are applied and summed:

Y = w_{BN} ⊙ X_{BN} + w_{LN} ⊙ X_{LN} + w_{IN} ⊙ X_{IN}

The weights are trained jointly with the rest of the network, adding only 3 × C parameters and negligible FLOPs. This design lets the model simultaneously exploit batch‑level, channel‑level and sample‑level statistics.

3.2.2 Multi‑View Token Mixer (MVTM)

MVTM splits the channel dimension into three groups ( C_{loc}, C_{mid}, C_{glo}) satisfying C_{loc}+C_{mid}+C_{glo}=C. Each group passes through a depth‑wise convolution with a different kernel size:

Local filter: small kernel (e.g., 3×3) for fine‑grained patterns.

Intermediate filter: medium kernel (e.g., 5×5) to bridge local and global contexts.

Global filter: large kernel (e.g., 7×7) for long‑range interactions.

After depth‑wise convolution, a point‑wise 1×1 convolution mixes the channels, and the three outputs are concatenated. Stage‑specific configurations adjust the channel ratios and the global kernel size; early stages allocate more channels to the local filter, while later stages increase the global filter’s share and shrink its kernel, matching the decreasing spatial resolution of the feature map.

3.2.3 MVFormer Block

The MVFormer block places MVN before the token mixer and again inside the MLP sub‑block, encouraging synergistic interaction between normalization and mixing. Both the token mixer and the MLP use the StarReLU activation function.

3.2.4 Overall Architecture

Four model sizes are defined:

MVFormer‑xT (tiny)

MVFormer‑T (tiny)

MVFormer‑S (small)

MVFormer‑B (base)

All share the same MetaFormer backbone augmented with MVN and MVTM; parameter counts and MACs increase from xT to B.

Experiments

4.1 Image Classification

Training on ImageNet‑1K (1.28 M images, 1 K classes) follows the recipe:

300 epochs, batch size 4096

AdamW optimizer, weight decay 0.05, base LR 4e‑3

Cosine‑annealing LR with 20‑epoch warm‑up

Data augmentations: RandAugment, Random Erasing, Mixup, CutMix, label smoothing

Random depth probabilities: 0.2 – 0.4 per variant

Results (Top‑1 accuracy):

MVFormer‑T = 83.4 %

MVFormer‑S = 84.3 %

MVFormer‑B = 84.6 %

Each variant outperforms the strongest convolution‑based baselines (ConvFormer‑S18, S36, M36) by 0.1‑0.4 % while using comparable or fewer MACs.

4.2 Object Detection & Instance Segmentation

On COCO 2017 (118 K train, 5 K val) the authors fine‑tune MVFormer‑T and MVFormer‑S as backbones for Mask RCNN and RetinaNet (implemented with mmdetection). Training uses a single‑scale 800‑pixel short side, learning rates 2e‑4 (Mask RCNN) and 1e‑4 (RetinaNet) with step decays at epochs 8/11 and 27/33 respectively. Both variants achieve the highest mean Average Precision (mAP) among ViT‑based detectors while requiring fewer parameters than competing models.

4.3 Semantic Segmentation

On ADE20K (20 K train, 2 K val) MVFormer‑T and MVFormer‑S serve as backbones for a Semantic FPN (mmsegmentation). Training runs for 40 K iterations, batch size 32, AdamW with cosine LR. The models improve mean IoU by 0.4 % (T) and 0.7 % (S) over the best convolution‑based ViTs (VAN‑B2/B3) at similar efficiency.

4.4 Ablation Studies

4.4.1 Individual Modules

Using the MVFormer‑xT backbone on ImageNet‑1K:

Adding MVN alone yields +0.53 % Top‑1.

Adding MVTM alone yields +0.17 % Top‑1.

Combining both reaches 81.30 % Top‑1, confirming complementary benefits.

4.4.2 Normalization Combinations

All three normalizations together (BN + LN + IN) outperform any pair. IN alone degrades performance, but its inclusion with BN/LN provides a synergistic gain, supporting the hypothesis that batch‑level, channel‑level and sample‑level statistics jointly enrich feature diversity.

4.4.3 MVN on Existing Architectures

Replacing LN with MVN in Swin‑T, ConvFormer‑S18, ConvNeXt‑T and PoolFormer‑S36, and replacing BN with MVN in ResNet‑50, consistently raises Top‑1 accuracy by ≈0.2 % across all five models, demonstrating MVN’s broad applicability.

4.4.4 MVTM Design

Removing the smallest (local) filter markedly reduces accuracy, indicating that fine‑grained filters are essential for covering diverse visual patterns. The stage‑specific channel ratios and global‑kernel sizes (listed in Table 1 of the original paper) are crucial for balancing local detail and global context.

4.4.5 Learned MVN Weights

Analysis of the learned weight vectors shows a consistent trend: LN receives the highest proportion in later stages, while BN and IN dominate early stages. This reflects a bias toward channel‑level statistics early in the network and spatial‑level statistics later, aligning with the intuition that early layers benefit from batch‑level stability and later layers from spatial diversity.

Conclusion

Integrating multi‑view normalization and a stage‑aware multi‑scale token mixer yields consistent accuracy improvements across classification, detection and segmentation tasks without increasing model size or computational cost. Extensive ablations confirm that each component contributes uniquely, and the approach can be transplanted into existing ViT and CNN backbones.

computer vision deep learning vision transformer Token Mixer Multi-View Normalization

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.