How Transformers Revolutionize Vision: From DETR to GCNet

This article explores how Transformer architectures, originally designed for NLP, are adapted for visual tasks, detailing pioneering models such as DETR, CBAM, NLNet, SENet, and GCNet, and explains their structures, attention mechanisms, advantages, and experimental findings for image processing.

TiPaiPai Technical Team
TiPaiPai Technical Team
TiPaiPai Technical Team
How Transformers Revolutionize Vision: From DETR to GCNet

The previous chapter explained the basic principles of Transformers in NLP; this section shows how the same architecture can be applied to visual tasks, where the encoder‑decoder design must handle 2‑D image features.

DETR: Transformer for Object Detection

DETR (Detection Transformer) brings Transformers to object detection, eliminating the need for hand‑crafted components like anchors. Its input‑output diagram is shown below.

DETR input‑output diagram
DETR input‑output diagram

DETR generates N box predictions in a single pass, where N is a preset number larger than the actual object count. It also introduces a bipartite matching loss that aligns predicted boxes with ground‑truth boxes.

The overall architecture consists of four parts: backbone, encoder, decoder, and feed‑forward network (FFN), illustrated below.

DETR overall structure
DETR overall structure

Backbone

Typical backbones such as ResNet process an input image of shape B×H×W×3 and output a feature map of shape B×H/32×W/32×256 (or 1024).

Encoder

The encoder first compresses the channel dimension with a 1×1 convolution, reshapes the spatial dimensions into a sequence, adds positional encoding, and then applies self‑attention to produce an attention map.

Decoder

The decoder processes all object queries at once, unlike the auto‑regressive decoder in the original Transformer. It receives two inputs: the sum of encoder embeddings and positional encodings, and a set of learnable object queries (shape 100×B×256). These queries model global relationships among objects, enabling more accurate predictions.

CBAM: Convolutional Block Attention Module

CBAM first extracts channel‑wise attention via global max‑pooling and average‑pooling, passes the results through a shared MLP, and combines them with a sigmoid activation. Then it computes spatial attention by applying max‑pooling and average‑pooling across channels, followed by a 7×7 convolution, batch normalization, and sigmoid. The overall flow is shown below.

CBAM structure
CBAM structure

NLNet: Non‑local Neural Networks

NLNet captures long‑range dependencies by applying a self‑attention‑like mechanism to image features. Its architecture is illustrated below.

NLNet structure
NLNet structure

After extracting backbone features, NLNet reduces dimensionality with three 1‑D convolutions, reshapes them to HW×512, computes pairwise similarity between two reshaped tensors, normalizes the result, and finally combines it with a 1×1 convolution and a residual connection. The core similarity computation is expressed as:

NLNet similarity formula
NLNet similarity formula

Three forms of the pairwise function f(x_i,x_j) are supported: Gaussian, Embedded Gaussian, and Dot product, each with its own formula (shown in the following images).

Gaussian and Embedded Gaussian
Gaussian and Embedded Gaussian
Dot product
Dot product

Although NLNet excels at modeling global context, its computational cost grows with H⁴, leading to high memory usage and slow inference.

SENet: Squeeze‑and‑Excitation Networks

SENet reduces the computational burden of self‑attention by using 1×1 convolutions for channel reduction. It consists of three steps: Squeeze, Excitation, and Scale. The workflow is illustrated below.

SENet workflow
SENet workflow

The Squeeze operation aggregates spatial information into a channel descriptor; Excitation generates channel‑wise weights via a learned parameter w; Scale multiplies the original features by these weights, effectively re‑calibrating channel importance.

GCNet: Combining Non‑local and Squeeze‑Excitation

GCNet merges the global context modeling of NLNet with the efficient channel re‑weighting of SENet. It adopts a query‑independent simplification of the NL block and shares this structure with the SE block, forming a Global Context (GC) block. The architecture comparison is shown below.

GCNet vs NLNet vs SENet
GCNet vs NLNet vs SENet

Experiments reveal that the simplified NL block achieves comparable performance to the full NL block with fewer parameters, and that inserting the GC block after the addition operation in a residual block yields the best results, improving baseline accuracy by 1‑3%.

Conclusion

Recent years have seen rapid development of Transformer‑based models for vision tasks. The key challenge lies in converting 2‑D image features into sequences and designing effective encoder‑decoder attention. Most state‑of‑the‑art methods combine channel and spatial attention (e.g., CBAM, SENet) with global context modules (e.g., NLNet, GCNet) to balance performance and computational cost. Future work will continue to pursue simpler structures with low overhead that capture both local and global information.

Attention MechanismsSelf-attentionDETR
TiPaiPai Technical Team
Written by

TiPaiPai Technical Team

At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.