Comprehensive Overview of Object Detection: From Traditional Methods to Modern Deep Learning Models
This article provides a comprehensive overview of object detection, detailing traditional sliding‑window approaches, deep‑learning based two‑stage and one‑stage models such as R‑CNN, Fast R‑CNN, Faster R‑CNN, Mask R‑CNN, and the YOLO family, and discusses current challenges and future research directions.
1. Introduction: Description of Object Detection Tasks
With the development of computer technology and the widespread application of computer vision principles, real‑time object tracking using image processing has become increasingly popular. Object detection, which requires both classification and precise localization, is a multi‑task problem with broad applications in intelligent transportation, surveillance, military target detection, and medical navigation.
2. Traditional Object Detection Methods
Traditional algorithms typically follow three steps:
Scanning the image with a sliding window (see Figure 1).
Extracting features from each window, commonly using SIFT or HOG.
Classifying the extracted features with a classifier.
Two main issues arise:
Window size: Varying image and object sizes require multi‑scale windows, increasing computational cost and redundancy.
Feature design: Hand‑crafted features cannot fully capture rich image information.
3. Deep‑Learning‑Based Object Detection Methods
Rapid advances in deep learning and GPU computing have led to two dominant families of detectors: one‑stage and two‑stage models.
3.1 Two‑stage Methods
Two‑stage models first generate region proposals and then classify each proposal. Representative models include the R‑CNN series.
3.1.1 R‑CNN
R‑CNN combines region proposals with CNN features. Its pipeline consists of (1) category‑agnostic region proposals, (2) a large CNN for feature extraction, and (3) a linear SVM for classification.
3.1.2 Fast R‑CNN
Fast R‑CNN improves training efficiency by sharing convolutional features between region proposals and classification, using ROI pooling to produce fixed‑size feature maps.
3.1.3 Faster R‑CNN
Faster R‑CNN integrates region proposal generation (RPN) with detection, sharing convolutional layers and adding bounding‑box regression, which greatly speeds up inference.
3.1.4 Mask R‑CNN
Mask R‑CNN extends Faster R‑CNN with an additional branch for instance segmentation.
3.1.5 Cascade R‑CNN
Cascade R‑CNN addresses the IoU threshold problem by training a series of detectors with progressively higher IoU thresholds, reducing over‑fitting and improving high‑quality detection.
3.2 One‑stage Methods
One‑stage detectors directly predict class probabilities and bounding‑box coordinates, offering higher speed. The YOLO family exemplifies this approach.
3.2.1 YOLOv1
YOLOv1 divides the image into an S×S grid, each cell predicting B bounding boxes and confidence scores. It is fast but limited by grid‑based predictions.
3.2.2 YOLOv2 (YOLO9000)
YOLOv2 introduces batch normalization, high‑resolution classifier fine‑tuning, anchor boxes, dimension clustering, multi‑scale training, and a new backbone (Darknet‑19), improving both accuracy and speed.
3.2.3 YOLOv3
YOLOv3 adopts multi‑scale predictions, uses Darknet‑53 as backbone, and employs binary cross‑entropy for multi‑label classification, achieving strong performance at moderate IoU thresholds.
3.2.4 YOLOv4
YOLOv4 combines many recent tricks (CSP, Mish, Mosaic, CIoU loss, etc.) to achieve state‑of‑the‑art speed‑accuracy trade‑offs on COCO.
3.2.5 YOLOv5
YOLOv5, an open‑source implementation, adds extensive data‑augmentation (Mosaic, Copy‑Paste, etc.), auto‑anchor calculation, warm‑up, cosine LR scheduler, EMA, and mixed‑precision training.
3.2.6 YOLOv6
YOLOv6 redesigns the backbone with RepVGG, introduces task‑alignment learning for label assignment, and adopts VariFocal loss and SIoU/GIoU for regression.
3.2.7 YOLOv7
YOLOv7 introduces bag‑of‑freebies, dynamic label assignment, and the E‑ELAN architecture to further boost real‑time detection performance.
4. Technical Status and Future Trends
4.1 Challenges
Data and annotation cost.
Hard examples in production environments.
Small‑object detection.
Complex and variable backgrounds.
4.2 Model Improvement Directions
Richer data augmentation (mosaic, mixup, cutmix, etc.).
Stronger backbones (Darknet, RepNet, ResNeXt, etc.).
More effective necks (FPN, SPP, PAN, etc.).
Enhanced basic components (convolutions, normalization, activation, loss, IoU/NMS).
Advanced feature extraction (attention, multi‑scale fusion).
Improved training strategies (warm‑up, cosine annealing, label smoothing, SAT).
Better label assignment (ATSS, OTA, PAA, TOOD).
4.3 Emerging Trends
Lightweight models for edge and mobile devices.
Domain adaptation to handle distribution shifts.
Unsupervised, semi‑supervised, and weakly‑supervised learning.
Few‑shot detection.
Integration with AutoML for automated architecture search.
5. References
[1] Girshick R, Donahue J, Darrell T, et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. IEEE Computer Society, 2013.
[2] Girshick R. Fast R‑CNN. ICCV, 2015.
[3] Ren S, He K, Girshick R, et al. Faster R‑CNN: Towards Real‑Time Object Detection with Region Proposal Networks. IEEE TPAMI, 2017.
[4] He K, Gkioxari G, Dollar P, et al. Mask R‑CNN. ICCV, 2017.
[5] Z. Cai and N. Vasconcelos, "Cascade R‑CNN: Delving Into High Quality Object Detection," CVPR, 2018.
[6] Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real‑Time Object Detection, CVPR, 2016.
[7] Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger, CVPR, 2017.
[8] Redmon J, Farhadi A. YOLOv3: An Incremental Improvement, arXiv, 2018.
[9] Bochkovskiy A, Wang C Y, Liao H. YOLOv4: Optimal Speed and Accuracy of Object Detection, 2020.
[10] Li C. YOLOv6: A Single‑Stage Object Detection Framework for Industrial Applications.
[11] Wang C Y, Bochkovskiy A, Liao H. YOLOv7: Trainable bag‑of‑freebies sets new state‑of‑the‑art for real‑time object detectors, arXiv, 2022.
政采云技术
ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.