How Transformers Revolutionize Vision: From DETR to GCNet
This article explores how Transformer architectures, originally designed for NLP, are adapted for visual tasks, detailing pioneering models such as DETR, CBAM, NLNet, SENet, and GCNet, and explains their structures, attention mechanisms, advantages, and experimental findings for image processing.
The previous chapter explained the basic principles of Transformers in NLP; this section shows how the same architecture can be applied to visual tasks, where the encoder‑decoder design must handle 2‑D image features.
DETR: Transformer for Object Detection
DETR (Detection Transformer) brings Transformers to object detection, eliminating the need for hand‑crafted components like anchors. Its input‑output diagram is shown below.
DETR generates N box predictions in a single pass, where N is a preset number larger than the actual object count. It also introduces a bipartite matching loss that aligns predicted boxes with ground‑truth boxes.
The overall architecture consists of four parts: backbone, encoder, decoder, and feed‑forward network (FFN), illustrated below.
Backbone
Typical backbones such as ResNet process an input image of shape B×H×W×3 and output a feature map of shape B×H/32×W/32×256 (or 1024).
Encoder
The encoder first compresses the channel dimension with a 1×1 convolution, reshapes the spatial dimensions into a sequence, adds positional encoding, and then applies self‑attention to produce an attention map.
Decoder
The decoder processes all object queries at once, unlike the auto‑regressive decoder in the original Transformer. It receives two inputs: the sum of encoder embeddings and positional encodings, and a set of learnable object queries (shape 100×B×256). These queries model global relationships among objects, enabling more accurate predictions.
CBAM: Convolutional Block Attention Module
CBAM first extracts channel‑wise attention via global max‑pooling and average‑pooling, passes the results through a shared MLP, and combines them with a sigmoid activation. Then it computes spatial attention by applying max‑pooling and average‑pooling across channels, followed by a 7×7 convolution, batch normalization, and sigmoid. The overall flow is shown below.
NLNet: Non‑local Neural Networks
NLNet captures long‑range dependencies by applying a self‑attention‑like mechanism to image features. Its architecture is illustrated below.
After extracting backbone features, NLNet reduces dimensionality with three 1‑D convolutions, reshapes them to HW×512, computes pairwise similarity between two reshaped tensors, normalizes the result, and finally combines it with a 1×1 convolution and a residual connection. The core similarity computation is expressed as:
Three forms of the pairwise function f(x_i,x_j) are supported: Gaussian, Embedded Gaussian, and Dot product, each with its own formula (shown in the following images).
Although NLNet excels at modeling global context, its computational cost grows with H⁴, leading to high memory usage and slow inference.
SENet: Squeeze‑and‑Excitation Networks
SENet reduces the computational burden of self‑attention by using 1×1 convolutions for channel reduction. It consists of three steps: Squeeze, Excitation, and Scale. The workflow is illustrated below.
The Squeeze operation aggregates spatial information into a channel descriptor; Excitation generates channel‑wise weights via a learned parameter w; Scale multiplies the original features by these weights, effectively re‑calibrating channel importance.
GCNet: Combining Non‑local and Squeeze‑Excitation
GCNet merges the global context modeling of NLNet with the efficient channel re‑weighting of SENet. It adopts a query‑independent simplification of the NL block and shares this structure with the SE block, forming a Global Context (GC) block. The architecture comparison is shown below.
Experiments reveal that the simplified NL block achieves comparable performance to the full NL block with fewer parameters, and that inserting the GC block after the addition operation in a residual block yields the best results, improving baseline accuracy by 1‑3%.
Conclusion
Recent years have seen rapid development of Transformer‑based models for vision tasks. The key challenge lies in converting 2‑D image features into sequences and designing effective encoder‑decoder attention. Most state‑of‑the‑art methods combine channel and spatial attention (e.g., CBAM, SENet) with global context modules (e.g., NLNet, GCNet) to balance performance and computational cost. Future work will continue to pursue simpler structures with low overhead that capture both local and global information.
TiPaiPai Technical Team
At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
