Understanding Stand-Alone Axial-Attention for Panoptic Segmentation
The paper proposes a stand‑alone axial‑attention mechanism that converts 2‑D attention into 1‑D to lower computational cost while preserving global context, introduces position‑sensitive self‑attention, integrates it into Axial‑ResNet and Axial‑DeepLab, and demonstrates strong results on four large segmentation datasets.
Abstract
Convolutional operators rely on locality to improve efficiency, while attention mechanisms handle long‑range dependencies. Recent work shows that stacking self‑attention layers with local constraints can form a fully attentive network. This paper eliminates the locality restriction by converting 2‑D attention into 1‑D attention, reducing computational complexity and enabling the network to operate over larger regions. A position‑sensitive self‑attention architecture is also introduced and evaluated on four large datasets for classification, panoptic segmentation, instance segmentation, and semantic segmentation.
Introduction
Convolution is a core module in computer vision because of translation equivariance and locality, which reduce parameter count and M‑Adds. However, these properties make modeling long‑range dependencies difficult.
Attention provides the ability to model long‑range dependencies in language modeling, speech recognition, and neural captioning, and shows great potential for vision tasks such as image classification, object detection, semantic segmentation, video classification, and adversarial defense.
Although stacking attention layers is promising, the architecture is computationally expensive. Prior work that applies local constraints reduces cost but also limits model capacity.
This study proposes axial attention, which efficiently computes and restores a large receptive field by sequentially decomposing 2‑D attention into 1‑D attention along the height and width axes.
Method
Position‑Sensitive Self‑Attention
The output at position o = (i, j) is computed by pooling the projected input, allowing the pooling operation to capture relevant but non‑local context across the entire feature map—something convolution cannot achieve.
Two drawbacks of vanilla self‑attention are identified: (1) high computational cost, limiting its use to high‑level CNN feature maps or small images; (2) global pooling discards positional information, which is crucial for vision. By adding local constraints and positional encodings to self‑attention, both issues are mitigated.
Position Sensitivity Details
Earlier position bias depended only on the query pixel’s height, not on the key pixel. The new design adds a key‑dependent positional bias, enabling the attention to capture precise spatial relationships.
This design, called position‑sensitive self‑attention, captures long‑range dependencies with exact positional information.
Axial Attention
Local constraints dramatically lower computational cost while still enabling a fully self‑attention model. Operating within a local square region keeps complexity quadratic in the region length and introduces a hyper‑parameter to balance performance against cost.
The paper introduces independent self‑attention called Axial attention to ensure global connectivity and efficient computation. Its layer description is illustrated below.
Axial‑ResNet
In the residual bottleneck block, the 3×3 convolution is replaced by two multi‑head Axial attention layers, while the two 1×1 convolutions are retained to shuffle features. This conversion yields an Axial‑ResNet architecture.
Axial‑DeepLab for Segmentation
To adapt Axial‑ResNet for segmentation, several changes are made to form Axial‑DeepLab:
DeepLab modifies the stride and dilation of the last one or two stages of ResNet; similarly, the stride of the final stage is removed, and the “atrous” attention module is omitted.
The Atrous Spatial Pyramid Pooling (ASPP) module is not used; experiments show Axial‑DeepLab works with or without ASPP.
Panoptic‑DeepLab adopts the same three‑convolution stem, dual decoder, and prediction heads.
Conclusion
This paper can be viewed as one of the early attempts to completely abandon convolutions and deploy an attention‑only model. Although Axial attention retains the same number of M‑Adds as convolution, it currently runs slower because dedicated kernels for such operations are scarce on existing accelerators.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
