Understanding Slide-Transformer: An Efficient Local Attention Module for Vision Transformers
This article explains the Slide-Transformer paper, describing how the proposed Slide Attention replaces inefficient Im2Col‑based local attention with depthwise convolutions and a deformable shift module, achieving high efficiency, flexibility, and hardware‑agnostic performance for Vision Transformers.
The paper introduces Slide Attention, a novel local attention module for Vision Transformers (ViT) that leverages standard convolution operations to achieve high efficiency, flexibility, and broad hardware compatibility.
Traditional local attention methods rely on sparse global attention or window attention, or use the Im2Col function to unfold local patches, which incurs high computational cost and limited flexibility, especially on devices without CUDA support.
Slide Attention reinterprets the Im2Col operation from a row‑wise perspective and replaces the costly data‑slicing with efficient depthwise convolutions, effectively shifting features without explicit matrix operations.
To further enhance flexibility, the authors add a deformable shift module that combines a designed convolution path with a learnable parallel path; during inference the two paths are merged via re‑parameterization, increasing model capacity while preserving inference speed.
Extensive experiments show that the Slide Attention module integrates seamlessly into various ViT architectures and runs efficiently on diverse hardware such as Metal Performance Shader (MPS) and iPhone devices, delivering consistent performance gains.
In summary, by substituting the inefficient Im2Col with depthwise convolutions and augmenting it with a deformable shift module, Slide Attention provides an efficient, flexible, and universal solution for local attention in vision models.
Paper: https://arxiv.org/pdf/2304.04237.pdf Code: https://github.com/LeapLabTHU/Slide-Transformer
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.