Artificial Intelligence 11 min read

Robust Input Visualization Methods for Vision Transformers

The paper proposes a robust Grad‑CAM‑inspired visualization for Vision Transformers that combines attention weights and gradients to generate class‑specific saliency maps, demonstrates superior alignment with discriminative regions across ViT, Swin and Volo models, and shows a 76% false‑positive reduction in Baidu’s porn‑content risk control system.

Baidu Geek Talk

Mar 28, 2022

Robust Input Visualization Methods for Vision Transformers

The article introduces a robust method for visualizing the input relevance of Vision Transformers (ViT) and demonstrates its practical impact in Baidu's content understanding and risk control systems.

1. Meaning and Significance of Visualization – Since the rise of AlexNet, CNNs have dominated visual tasks, but both CNNs and Transformers are largely black‑box models. Visualizing internal activations or gradients as saliency maps helps researchers understand what pixels or regions drive a model’s decision.

2. Differences Between CNN and Transformer Input Visualization – CNNs retain spatial correspondence between feature maps and the input image, enabling methods such as Class Activation Map (CAM) and Grad‑CAM to up‑sample activation or gradient maps to the input resolution. In contrast, ViT splits an image into patches, embeds each patch into a token, and processes them with self‑attention; there is no direct spatial feature map, so traditional CAM‑based techniques cannot be applied directly.

3. Existing Transformer Visualization Approaches – Current methods (e.g., rollout, LRP, Partial‑LRP) rely on self‑attention weights or gradient‑based relevance propagation but have limitations such as class‑agnostic outputs or poor performance on small objects.

4. Proposed Robust Transformer Input Visualization Method – Inspired by Grad‑CAM, the method combines self‑attention activations and their gradients. For each Transformer block, the class token’s attention weights to all patch tokens are multiplied by the corresponding gradients (obtained by back‑propagating a one‑hot target logit). The per‑block saliency maps are averaged across heads, then multiplied across all blocks (with a small epsilon added to avoid zero‑value collapse) to obtain a final patch‑level importance score. These scores are reshaped, up‑sampled to the original image size, and normalized to produce the input saliency map.

5. Experimental Results – Visualizations on ImageNet‑trained ViT, Swin‑Transformer, and Volo models show that the highlighted regions align well with the true discriminative pixels for both top‑1 and top‑2 predicted classes. The method is robust across architectures and yields class‑specific explanations.

6. Business Application Case – In Baidu’s image‑search porn‑content risk control, a Volo‑based classifier suffered high false‑positive rates. By applying the proposed visualization to mis‑detected samples, 19 problematic visual patterns (e.g., dark shorts, crossed arms) were identified and used to augment negative samples for fine‑tuning. The false‑positive rate dropped by 76.2% while maintaining recall.

7. Conclusion – The presented robust input visualization technique effectively explains Vision Transformer decisions and has been deployed in real‑world Baidu services. Future work may incorporate the value (V) component of self‑attention for even richer explanations.

References

[1] Samira Abnar & Willem Zuidema, “Quantifying attention flow in transformers,” arXiv:2005.00928, 2020.

[2] Sebastian Bach et al., “On pixel‑wise explanations for non‑linear classifier decisions by layer‑wise relevance propagation.”

[3] Elena Voita et al., “Analyzing multi‑head self‑attention: Specialized heads do the heavy lifting, the rest can be pruned,” ACL 2019.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning Self-Attention model interpretability Vision Transformer Grad-CAM Input Visualization

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.