Artificial Intelligence 29 min read

Comprehensive Overview of Object Detection: From Traditional Methods to Modern Deep Learning Models

This article provides a comprehensive overview of object detection, describing traditional sliding‑window approaches, deep‑learning based two‑stage and one‑stage models such as R‑CNN, Faster R‑CNN, YOLO series, and discusses current challenges, improvement directions, and future research trends in the field.

政采云技术
政采云技术
政采云技术
Comprehensive Overview of Object Detection: From Traditional Methods to Modern Deep Learning Models

1. Introduction: Object Detection Task Description

With the rapid development of computer technology and the widespread application of computer vision principles, real‑time object tracking using image processing has become increasingly popular. Object detection is more fine‑grained than classification: it must separate foreground from background, locate objects, and predict their sizes, making it a multi‑task problem.

2. Traditional Object Detection Methods

Traditional pipelines consist of three steps: (1) scanning the image with a sliding window of multiple scales, (2) extracting handcrafted features such as SIFT or HOG from each window, and (3) classifying each window with a classifier. This approach suffers from two major issues: the need for many scales leads to high computational cost and redundant windows, and handcrafted features cannot capture the full richness of image information.

Figure 1: Traditional sliding‑window scanning

3. Deep Learning Based Object Detection Methods

The rise of deep learning and GPU acceleration has led to two dominant families of detectors: two‑stage methods (e.g., R‑CNN series) that first generate region proposals and then classify them, and one‑stage methods (e.g., YOLO series) that directly predict class probabilities and bounding boxes.

3.1 Two‑stage

3.1.1 R‑CNN

R‑CNN extracts region proposals, feeds each proposal into a CNN to obtain fixed‑size feature vectors, and classifies them with a linear SVM. The system consists of region proposal generation, feature extraction, and classification.

Figure 2: R‑CNN architecture

3.1.2 Fast R‑CNN

Fast R‑CNN improves R‑CNN by sharing convolutional computation across all proposals and introducing ROI pooling to produce fixed‑size feature maps. It uses selective search for proposals and trains both classification and bounding‑box regression jointly, greatly speeding up training and inference.

Figure 3: Fast R‑CNN architecture

3.1.3 Faster R‑CNN

Faster R‑CNN integrates a Region Proposal Network (RPN) into the CNN, enabling end‑to‑end training and significantly higher speed. The RPN predicts objectness scores and bounding‑box refinements for a set of anchors.

Figure 4: Faster R‑CNN architecture

3.1.4 Mask R‑CNN

Mask R‑CNN extends Faster R‑CNN with an additional branch that predicts a binary mask for each detected instance, enabling instance segmentation.

Figure 5: Mask R‑CNN framework

3.1.5 Cascade R‑CNN

Cascade R‑CNN addresses the IoU threshold problem by training a series of detectors with increasing IoU thresholds, allowing each stage to refine the proposals from the previous stage and reducing over‑fitting.

Figure 6: Cascade R‑CNN compared with other frameworks

3.2 One‑stage

One‑stage detectors directly predict class probabilities and bounding‑box coordinates, offering superior inference speed. The YOLO family is presented as a representative series.

3.2.1 YOLOv1

YOLOv1 divides the input image into an S×S grid; each cell predicts B bounding boxes and confidence scores, as well as C class probabilities. Advantages include speed and good generalization, while limitations involve limited predictions per cell and sensitivity to small objects.

Figure 7: YOLOv1 network

3.2.2 YOLOv2

YOLOv2 (YOLO9000) introduces batch normalization, higher‑resolution classifier fine‑tuning, anchor boxes, dimension clustering, multi‑scale training, and a new backbone (Darknet‑19). These changes improve accuracy while maintaining speed.

Figure 8: YOLOv2 improvements over YOLOv1

Batch Normalization : adds BN to all conv layers, improving mAP by >2%.

High‑resolution classifier : fine‑tunes on 448×448 images.

Anchor boxes : adopts Faster R‑CNN‑style anchors, increasing recall.

Dimension clusters : k‑means clustering of box dimensions.

Direct location prediction : constrains predictions within grid cells.

Fine‑grained features : introduces a passthrough layer to preserve small‑object details.

Multi‑scale training : randomly varies input size every 10 batches.

3.2.3 YOLOv3

YOLOv3 builds on YOLOv2 by using Darknet‑53 as backbone, multi‑scale predictions, and binary cross‑entropy loss for multi‑label classification. It achieves strong performance at IoU=0.5 but degrades at higher IoU thresholds.

Figure 11: Darknet‑53 backbone

3.2.4 YOLOv4

YOLOv4 combines many recent tricks (CSP, Mish, Mosaic, CIoU loss, etc.) to reach 43.5% AP on COCO while being trainable on a single 1080Ti/2080Ti GPU.

Figure 12: YOLOv4 compared with other methods

3.2.5 YOLOv5

YOLOv5 (open‑source on GitHub) introduces Focus → Conv, replaces SPP with SPPF, and provides five model sizes (n, s, m, l, x). It adds extensive data‑augmentation (Mosaic, Copy‑Paste, MixUp, etc.) and training tricks such as warm‑up, cosine LR scheduler, EMA, and automatic anchor calculation.

Figure 13: YOLOv5l model

Data augmentation: Mosaic, Copy‑Paste, Random affine, MixUp, Albumentations, HSV augmentation, horizontal flip.

Training strategy: multi‑scale training, automatic anchor clustering, warm‑up, cosine LR, EMA, mixed‑precision, hyper‑parameter evolution.

Loss: BCE for classification/objectness, CIoU for localization with weighted scale‑specific losses.

3.2.6 YOLOv6

YOLOv6 redesigns the backbone with RepVGG/RepBlock, uses Rep‑PAN in the neck, and introduces the TAL (Task Alignment Learning) label assignment and VariFocal + SIoU/GIoU losses. It also incorporates self‑distillation for industrial deployment.

Figure 15: YOLOv6 framework

Backbone: RepVGG‑style blocks for small models, CSP‑Stack‑Rep for large models.

Neck: Rep‑PAN topology.

Label assignment: TAL dynamic matching.

Losses: VariFocal for classification, SIoU/GIoU for regression.

3.2.7 YOLOv7

YOLOv7 introduces bag‑of‑freebies, ELAN‑based E‑ELAN expansion, compound model scaling, and advanced label‑assignment strategies (coarse‑fine auxiliary/lead heads). It achieves state‑of‑the‑art speed‑accuracy trade‑offs.

Figure 16: Expanded Efficient Layer Aggregation Network (E‑ELAN)

Bag‑of‑freebies: techniques that improve accuracy without extra inference cost.

E‑ELAN: expand‑shuffle‑merge cardinality to enhance learning capacity.

Compound scaling: jointly scales depth and width of blocks.

Advanced label assignment: auxiliary and lead heads with coarse‑to‑fine supervision.

4. Technical Status and Future Trends

4.1 Challenges

Data and annotation cost : Large labeled datasets are expensive and may contain errors.

Difficult samples : Models often underperform on hard examples in production.

Small‑object detection : Small objects are prone to over‑fitting or missed detections.

Complex backgrounds : Occlusion, blur, illumination, crowding, and high similarity hinder detection.

4.2 Model Improvement Directions

Richer data augmentation (mosaic, mixup, cutmix, etc.).

More powerful backbones (Darknet, RepNet, ResNeXt, etc.).

Effective neck designs (FPN, SPP, PAN, etc.).

Enhanced basic components: convolutions, normalization, activations, pooling, regularization, loss functions, IoU/NMS algorithms.

Stronger feature extraction via attention, context, multi‑scale fusion.

Improved training strategies: warm‑up, cosine annealing, genetic algorithms, label smoothing, SAT.

Better label assignment methods: ATSS, OTA, PAA, TOOD, etc.

4.3 Emerging Trends

Lightweight models : Deployable on mobile/edge devices for real‑time inference.

Domain adaptation : Reduce distribution shift between training and deployment data.

Unsupervised / semi‑supervised learning : Leverage unlabeled data for detection.

Few‑shot detection : Rapidly learn new categories from limited examples.

AutoML for detection : Automate architecture search and hyper‑parameter tuning.

5. References

[1] Girshick R, Donahue J, Darrell T, et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. IEEE Computer Society, 2013.

[2] Girshick R. Fast R‑CNN. International Conference on Computer Vision, 2015.

[3] Ren S, He K, Girshick R, et al. Faster R‑CNN: Towards Real‑Time Object Detection with Region Proposal Networks. IEEE TPAMI, 2017.

[4] He K, Gkioxari G, Dollar P, et al. Mask R‑CNN. ICCV, 2017.

[5] Z. Cai and N. Vasconcelos, "Cascade R‑CNN: Delving Into High Quality Object Detection," CVPR, 2018.

[6] Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real‑Time Object Detection. CVPR, 2016.

[7] Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger. CVPR, 2017.

[8] Redmon J, Farhadi A. YOLOv3: An Incremental Improvement. arXiv, 2018.

[9] Bochkovskiy A, Wang C Y, Liao H. YOLOv4: Optimal Speed and Accuracy of Object Detection, 2020.

[10] Li C. YOLOv6: A Single‑Stage Object Detection Framework for Industrial Applications.

[11] Wang C Y, Bochkovskiy A, Liao H. YOLOv7: Trainable bag‑of‑freebies sets new state‑of‑the‑art for real‑time object detectors. arXiv, 2022.

Computer Visiondeep learningobject detectionR-CNNYOLO
政采云技术
Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.