Artificial Intelligence 12 min read

How DETR and Its Successors Evolve: A Deep Dive into the DETR Series for Object Detection

This article reviews the original DETR model, analyzes its strengths and weaknesses, and then examines two major follow‑up works—Deformable‑DETR and DAB‑DETR—explaining how they modify attention mechanisms, introduce deformable convolutions and dynamic anchor boxes to accelerate convergence and improve small‑object detection.

Network Intelligence Research Center (NIRC)

Jun 5, 2023

How DETR and Its Successors Evolve: A Deep Dive into the DETR Series for Object Detection

DETR, published at ECCV 2020, has amassed 6,287 citations (as of 2023‑06‑02). Its novelty lies in formulating object detection as a bipartite graph‑matching problem solved with the Hungarian algorithm and in providing an end‑to‑end pipeline that eliminates hand‑crafted components such as anchor boxes and non‑maximum suppression.

Experimental results in the DETR paper show that its average precision (AP) on large and medium objects surpasses traditional detectors like Faster‑RCNN, and the model can visualize which image regions its self‑attention focuses on.

However, DETR also has clear drawbacks: it converges much more slowly than Faster‑RCNN and its performance on small objects lags behind the best CNN‑based detectors.

Deformable‑DETR (ICLR 2021) addresses these issues by borrowing the idea of deformable convolution. Deformable convolution adds a learned offset to each sampling point of a regular 3×3 kernel, turning the square receptive field into a flexible polygon.

The core change in Deformable‑DETR is the replacement of the standard self‑attention with a deformable attention mechanism. Key vectors are image features from a preceding CNN layer plus positional encoding; query vectors are the same in the encoder and learnable object queries in the decoder; value vectors are copies of the feature map. Instead of computing attention between every query and all keys, each query samples only a small set of locations (original position plus a learned offset) and attends to the corresponding values.

This design dramatically reduces computational cost and, because the sampled locations provide a sparse, deformable receptive field, it improves performance on small objects.

Nevertheless, the Deformable‑DETR paper does not present a concrete analysis of why the original DETR converges slowly or struggles with small objects; the improvements are justified mainly by empirical gains.

DAB‑DETR (ICLR 2022) seeks a deeper explanation. The authors argue that the root cause of DETR’s slow convergence and poor small‑object performance lies in the queries used in the cross‑attention module of the decoder.

They distinguish two types of queries: decoder embeddings , which are fixed (initialized to zero) and serve as carriers of image semantics, and learnable queries , which are trainable embeddings that act as a set of questions asking the image features “what is here?”. The following code snippet illustrates their definitions:

# Learnable Queries
query_embed = torch.nn.Embedding(num_queries, hidden_dim)
# Decoder Embeddings
tgt = torch.zeros_like(query_embed)

Experiments that freeze the learned query_embed while training the rest of the model show only a modest early‑epoch speedup, ruling out the hypothesis that the difficulty of optimizing the queries themselves causes slow convergence.

Visualization of attention maps reveals that vanilla DETR queries often attend to multiple circular regions, indicating a lack of positional priors. Conditional‑DETR addresses this by encoding queries with the same sinusoidal position encoding used in the encoder, yielding Gaussian‑like attention maps. DAB‑DETR goes further by introducing dynamic anchor boxes—learnable 4‑dim vectors (x, y, h, w)—as positional priors for each query, effectively bringing back the classic anchor‑box concept while keeping the model end‑to‑end.

These dynamic anchor boxes allow each query to focus on a specific region with an adaptable shape and size, which improves both convergence speed and detection accuracy, especially for small objects.

In summary, the DETR series illustrates a spiral‑like evolution: DETR discards traditional components, Deformable‑DETR incorporates deformable convolutions to reduce computation, and DAB‑DETR re‑introduces learnable anchor boxes through query design, achieving faster training and better performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

object detection transformer attention DETR anchor box DAB-DETR Deformable-DETR

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.