Artificial Intelligence 17 min read

Overview of Object Detection Algorithms: Two‑Stage and One‑Stage Methods

This article reviews the evolution of visual object detection, explaining traditional region‑based approaches, the rise of deep‑learning two‑stage frameworks such as R‑CNN, Fast R‑CNN and Faster R‑CNN, and the faster one‑stage models like Overfeat, YOLO, SSD and RetinaNet, together with their design choices, training strategies and loss functions.

HomeTech
HomeTech
HomeTech
Overview of Object Detection Algorithms: Two‑Stage and One‑Stage Methods

Object detection is a core problem in computer vision, aiming to locate and classify objects such as cars, ships, and pedestrians in images. Early methods relied on region selection (e.g., segmentation, binarization) followed by handcrafted features, or sliding‑window classifiers like Haar/LBP/HOG combined with SVM.

Since the introduction of R‑CNN, deep learning has become the dominant paradigm. Detection frameworks are broadly divided into two categories:

Two‑stage methods first generate candidate regions and then perform classification and bounding‑box regression. Representative models include the R‑CNN series (R‑CNN, Fast R‑CNN, Faster R‑CNN). R‑CNN uses selective search to propose ~2000 regions, extracts 4096‑dimensional features with a CNN, and classifies them with an SVM. Fast R‑CNN shares convolutional features across all proposals, adds an ROI‑pooling layer, and jointly optimizes classification (softmax loss) and bounding‑box regression (smooth L1 loss). Faster R‑CNN replaces external region proposal with a Region Proposal Network (RPN) that predicts objectness scores and box offsets directly from shared convolutional maps, enabling end‑to‑end training.

One‑stage methods predict object locations and categories in a single forward pass, offering higher speed at the cost of some accuracy. Notable examples are Overfeat, YOLO (v1, v2), SSD, and RetinaNet. YOLO treats detection as a regression problem on a fixed‑size grid (e.g., 7×7), outputting bounding‑box coordinates, confidence scores, and class probabilities. SSD adds multi‑scale feature maps and default boxes of various aspect ratios, while RetinaNet introduces a Feature Pyramid Network (FPN) and focal loss to address class imbalance.

All these models share common components: a backbone CNN (often VGG‑16 or ResNet), region or anchor generation, ROI‑pooling or convolutional predictors, and multi‑task loss functions that combine classification (softmax or focal loss) with regression (smooth L1). Training strategies vary from alternating optimization of RPN and detection heads to joint end‑to‑end learning, and from using pretrained ImageNet weights to training from scratch with data augmentation.

The article also discusses practical details such as anchor box design, scale selection, non‑maximum suppression, and the impact of input resolution on speed and accuracy. References to seminal papers (Selective Search, R‑CNN, Fast R‑CNN, Faster R‑CNN, Overfeat, YOLO, SSD, RetinaNet) are provided for further reading.

Computer Visiondeep learningobject detectionSSDR-CNNYOLO
HomeTech
Written by

HomeTech

HomeTech tech sharing

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.