Artificial Intelligence 20 min read

Twins: Efficient Visual Attention Models for Vision Transformers

The Twins series, a collaboration between Meituan and the University of Adelaide, introduces conditional positional encoding and spatially separable self‑attention to improve efficiency and performance of vision transformers, achieving state‑of‑the‑art results on ImageNet, ADE20K, COCO and high‑precision map segmentation.

Meituan Technology Team

Mar 24, 2022

Twins: Efficient Visual Attention Models for Vision Transformers

Overview

Twins is a visual attention model jointly proposed by Meituan and the University of Adelaide. The paper, accepted at NeurIPS 2021, details the challenges addressed, the design of two model families (Twins‑PCPVT and Twins‑SVT), and extensive experiments on several vision benchmarks.

Background

Vision Transformers (ViT) demonstrated that Transformer architectures could surpass convolutional networks on image classification, but they suffer from high computational cost and limited adaptability to dense prediction tasks such as detection and segmentation. Pyramid Vision Transformer (PVT) introduced a pyramid structure to generate multi‑scale features, yet its fixed positional encoding and global self‑attention remain costly.

Subsequent works like Swin Transformer reduced computation by window‑based attention, at the expense of weaker global interactions.

Twins Model Design

To tackle the identified difficulties—high efficiency, flexible attention, and downstream task suitability—Twins combines insights from PVT and CPVT and proposes two architectures:

Twins‑PCPVT : Replaces the fixed positional encoding in PVT with the Conditional Positional Encoding (CPE) from CPVT, enabling translation equivariance and handling variable‑size inputs. The CPE is generated by a Positional Encoding Generator (PEG) placed after the first Transformer encoder of each stage.

Twins‑SVT : Introduces Spatially Separable Self‑Attention (SSSA), which groups the spatial dimension, computes local self‑attention within each group, and then fuses the results with a global self‑attention (GSA). This design mirrors depthwise separable convolutions, offering lower complexity (linear in input size) while preserving global context.

Conditional Positional Encoding

The PEG module transforms the token tensor (shape B×N×C) into a feature map, applies a depthwise convolution, and adds the result back to the original tokens, producing position‑aware embeddings that match the input size. The implementation is straightforward:

class PEG(nn.Module):
    def __init__(self, in_chans, embed_dim):
        super(PEG, self).__init__()
        self.peg = nn.Conv2d(in_chans, embed_dim, 3, 1, 1, bias=True, groups=embed_dim)
    def forward(self, feat_token, H, W):
        B, N, C = feat_token.shape
        cnn_feat = feat_token.transpose(1, 2).view(B, C, H, W)
        x = self.peg(cnn_feat) + cnn_feat
        x = x.flatten(2).transpose(1, 2)
        return x

Spatially Separable Self‑Attention (SSSA)

SSSA consists of two complementary modules:

Local Self‑Attention (LSA) : Splits the feature map into windows, computes attention within each window, and reshapes the result back to the original token sequence.

Global Self‑Attention (GSA) : Generates a down‑sampled representation for keys and values while keeping queries global, thus retaining full‑image context with reduced cost.

Key code fragments are shown below (operations follow standard PyTorch conventions):

class LSA(nn.Module):
    def forward(self, x, H, W):
        B, N, C = x.shape
        h_group, w_group = H // self.ws, W // self.ws
        total_groups = h_group * w_group
        x = x.reshape(B, h_group, self.ws, w_group, self.ws, C).transpose(2, 3)
        qkv = self.qkv(x).reshape(B, total_groups, -1, 3, self.num_heads, C // self.num_heads)
        qkv = qkv.permute(3, 0, 2, 1, 4, 5)
        q, k, v = qkv[0], qkv[1], qkv[2]
        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)
        attn = (attn @ v).transpose(2, 3).reshape(B, N, C)
        x = self.proj(attn)
        x = self.proj_drop(x)
        return x

class GSA(nn.Module):
    def forward(self, x, H, W):
        B, N, C = x.shape
        q = self.q(x).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3)
        x_ = self.sr(x.permute(0, 2, 1).reshape(B, C, H, W)).reshape(B, C, -1).permute(0, 2, 1)
        x_ = self.norm(x_)
        kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        k, v = kv[0], kv[1]
        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x

Experiments

ImageNet‑1k Classification : Twins‑PCPVT and Twins‑SVT achieve state‑of‑the‑art accuracy with higher throughput. Using NVIDIA TensorRT 7.0, Twins‑SVT‑S runs at 1.6× speedup, increasing images per second from 1059 to 1732.

ADE20K Semantic Segmentation : When combined with FPN or UperNet back‑ends, Twins outperforms PVT and Swin, delivering better mean IoU scores (see Table 2).

COCO Object Detection : Under both RetinaNet and Mask‑RCNN frameworks, Twins models surpass PVT and are comparable to Swin of similar scale (see Tables 3 and 4). Longer training (3×) further stabilizes performance.

Application to High‑Precision Map Semantic Segmentation

High‑precision maps are critical for autonomous driving. Twins serves as the backbone for multi‑element semantic segmentation, replacing heavier FPN/UperNet heads with a lightweight linear‑scaling and concatenation pipeline, achieving finer edge details (e.g., lane markings, road signs) as illustrated in Figure 10.

Conclusion

Visual attention models have become a research focus, yet efficiency remains a challenge. Twins addresses this by introducing conditional positional encoding and spatially separable self‑attention, reducing computation while improving performance across classification, detection, and segmentation tasks. Its deployment in Meituan’s high‑precision mapping pipeline demonstrates practical impact, and the authors plan to continue exploring efficient attention designs for broader business scenarios.

References

[1] Twins: Revisiting the Design of Spatial Attention in Vision Transformers, NeurIPS 2021.

[2] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions.

[3] Conditional Positional Encodings for Vision Transformers.

[4] An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.

[5] Attention Is All You Need.

[6] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

[7] Training Data‑Efficient Image Transformers & Distillation through Attention.

[8] Encoder‑Decoder with Atrous Separable Convolution for Semantic Image Segmentation.

[9] Panoptic Feature Pyramid Networks.

[10] Unified Perceptual Parsing for Scene Understanding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Vision Transformers COCO ImageNet Efficient Attention ADE20K Conditional Positional Encoding Spatially Separable Self-Attention

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.