Artificial Intelligence 26 min read

Comprehensive Overview of Object Detection: From Traditional Methods to Modern Deep Learning Models

This article provides a comprehensive overview of object detection, detailing traditional sliding‑window approaches, deep‑learning based two‑stage and one‑stage models such as R‑CNN, Fast R‑CNN, Faster R‑CNN, Mask R‑CNN, and the YOLO family, and discusses current challenges and future research directions.

政采云技术
政采云技术
政采云技术
Comprehensive Overview of Object Detection: From Traditional Methods to Modern Deep Learning Models

1. Introduction: Description of Object Detection Tasks

With the development of computer technology and the widespread application of computer vision principles, real‑time object tracking using image processing has become increasingly popular. Object detection, which requires both classification and precise localization, is a multi‑task problem with broad applications in intelligent transportation, surveillance, military target detection, and medical navigation.

2. Traditional Object Detection Methods

Traditional algorithms typically follow three steps:

Scanning the image with a sliding window (see Figure 1).

Extracting features from each window, commonly using SIFT or HOG.

Classifying the extracted features with a classifier.

Two main issues arise:

Window size: Varying image and object sizes require multi‑scale windows, increasing computational cost and redundancy.

Feature design: Hand‑crafted features cannot fully capture rich image information.

3. Deep‑Learning‑Based Object Detection Methods

Rapid advances in deep learning and GPU computing have led to two dominant families of detectors: one‑stage and two‑stage models.

3.1 Two‑stage Methods

Two‑stage models first generate region proposals and then classify each proposal. Representative models include the R‑CNN series.

3.1.1 R‑CNN

R‑CNN combines region proposals with CNN features. Its pipeline consists of (1) category‑agnostic region proposals, (2) a large CNN for feature extraction, and (3) a linear SVM for classification.

3.1.2 Fast R‑CNN

Fast R‑CNN improves training efficiency by sharing convolutional features between region proposals and classification, using ROI pooling to produce fixed‑size feature maps.

3.1.3 Faster R‑CNN

Faster R‑CNN integrates region proposal generation (RPN) with detection, sharing convolutional layers and adding bounding‑box regression, which greatly speeds up inference.

3.1.4 Mask R‑CNN

Mask R‑CNN extends Faster R‑CNN with an additional branch for instance segmentation.

3.1.5 Cascade R‑CNN

Cascade R‑CNN addresses the IoU threshold problem by training a series of detectors with progressively higher IoU thresholds, reducing over‑fitting and improving high‑quality detection.

3.2 One‑stage Methods

One‑stage detectors directly predict class probabilities and bounding‑box coordinates, offering higher speed. The YOLO family exemplifies this approach.

3.2.1 YOLOv1

YOLOv1 divides the image into an S×S grid, each cell predicting B bounding boxes and confidence scores. It is fast but limited by grid‑based predictions.

3.2.2 YOLOv2 (YOLO9000)

YOLOv2 introduces batch normalization, high‑resolution classifier fine‑tuning, anchor boxes, dimension clustering, multi‑scale training, and a new backbone (Darknet‑19), improving both accuracy and speed.

3.2.3 YOLOv3

YOLOv3 adopts multi‑scale predictions, uses Darknet‑53 as backbone, and employs binary cross‑entropy for multi‑label classification, achieving strong performance at moderate IoU thresholds.

3.2.4 YOLOv4

YOLOv4 combines many recent tricks (CSP, Mish, Mosaic, CIoU loss, etc.) to achieve state‑of‑the‑art speed‑accuracy trade‑offs on COCO.

3.2.5 YOLOv5

YOLOv5, an open‑source implementation, adds extensive data‑augmentation (Mosaic, Copy‑Paste, etc.), auto‑anchor calculation, warm‑up, cosine LR scheduler, EMA, and mixed‑precision training.

3.2.6 YOLOv6

YOLOv6 redesigns the backbone with RepVGG, introduces task‑alignment learning for label assignment, and adopts VariFocal loss and SIoU/GIoU for regression.

3.2.7 YOLOv7

YOLOv7 introduces bag‑of‑freebies, dynamic label assignment, and the E‑ELAN architecture to further boost real‑time detection performance.

4. Technical Status and Future Trends

4.1 Challenges

Data and annotation cost.

Hard examples in production environments.

Small‑object detection.

Complex and variable backgrounds.

4.2 Model Improvement Directions

Richer data augmentation (mosaic, mixup, cutmix, etc.).

Stronger backbones (Darknet, RepNet, ResNeXt, etc.).

More effective necks (FPN, SPP, PAN, etc.).

Enhanced basic components (convolutions, normalization, activation, loss, IoU/NMS).

Advanced feature extraction (attention, multi‑scale fusion).

Improved training strategies (warm‑up, cosine annealing, label smoothing, SAT).

Better label assignment (ATSS, OTA, PAA, TOOD).

4.3 Emerging Trends

Lightweight models for edge and mobile devices.

Domain adaptation to handle distribution shifts.

Unsupervised, semi‑supervised, and weakly‑supervised learning.

Few‑shot detection.

Integration with AutoML for automated architecture search.

5. References

[1] Girshick R, Donahue J, Darrell T, et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. IEEE Computer Society, 2013.

[2] Girshick R. Fast R‑CNN. ICCV, 2015.

[3] Ren S, He K, Girshick R, et al. Faster R‑CNN: Towards Real‑Time Object Detection with Region Proposal Networks. IEEE TPAMI, 2017.

[4] He K, Gkioxari G, Dollar P, et al. Mask R‑CNN. ICCV, 2017.

[5] Z. Cai and N. Vasconcelos, "Cascade R‑CNN: Delving Into High Quality Object Detection," CVPR, 2018.

[6] Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real‑Time Object Detection, CVPR, 2016.

[7] Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger, CVPR, 2017.

[8] Redmon J, Farhadi A. YOLOv3: An Incremental Improvement, arXiv, 2018.

[9] Bochkovskiy A, Wang C Y, Liao H. YOLOv4: Optimal Speed and Accuracy of Object Detection, 2020.

[10] Li C. YOLOv6: A Single‑Stage Object Detection Framework for Industrial Applications.

[11] Wang C Y, Bochkovskiy A, Liao H. YOLOv7: Trainable bag‑of‑freebies sets new state‑of‑the‑art for real‑time object detectors, arXiv, 2022.

R-CNNYOLO
政采云技术
Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.