Artificial Intelligence 30 min read

Survey of Deep Learning Based Object Detection Algorithms

This survey reviews two‑stage and one‑stage deep learning object detection methods—from early R‑CNN and OverFeat to modern Faster R‑CNN, Mask R‑CNN, SSD, and YOLO variants—detailing their architectural advances, training strategies, speed‑accuracy trade‑offs, and benchmark performance for researchers and industry practitioners.

Meitu Technology
Meitu Technology
Meitu Technology
Survey of Deep Learning Based Object Detection Algorithms

Object detection is a fundamental task in computer vision with nearly two decades of research. With the rise of deep learning, detection methods have shifted from hand‑crafted features to deep neural networks, evolving from early R‑CNN and OverFeat to modern two‑stage (e.g., Faster R‑CNN, R‑FCN, Mask R‑CNN) and one‑stage (e.g., SSD, YOLO series, Pelee) approaches.

The purpose of this survey is twofold: to provide newcomers with a concise technical overview of object detection, and to offer industry practitioners references for selecting and adapting detection methods to real‑world scenarios.

Background

Object detection aims to locate and classify objects in images or videos. Challenges include variable object counts, diverse appearances, occlusions, and lighting conditions. Modern methods are broadly categorized into two‑stage pipelines that generate region proposals before classification (e.g., R‑CNN series) and one‑stage pipelines that directly predict class and bounding‑box coordinates (e.g., YOLO, SSD).

Two/One‑stage Algorithm Improvements

R‑FCN and R‑FCN‑3000 : R‑FCN introduces position‑sensitive score maps to enhance spatial awareness. Mask R‑CNN extends Faster R‑CNN with a mask branch for pixel‑level segmentation and adopts RoIAlign to avoid misalignment. R‑FCN‑3000 further reduces computation by grouping classes into super‑classes, decreasing the number of score‑map channels while maintaining accuracy.

Mask R‑CNN : Replaces RoIPooling with RoIAlign, adds a parallel mask branch, and uses a multi‑task loss that combines classification, bounding‑box regression, and mask prediction.

YOLO9000, YOLOv3 : YOLO9000 improves speed and accuracy by adding batch‑norm, using higher‑resolution pre‑training, anchor boxes, k‑means clustering for anchor shapes, and multi‑scale training. YOLOv3 builds on YOLO9000 with a deeper Darknet‑53 backbone, multi‑scale predictions, and objectness‑scaled distillation for knowledge transfer.

Object detection at 200 FPS : Proposes a deep‑but‑narrow network, a novel distillation loss, and FM‑NMS to achieve 200 fps while preserving high detection rates.

DSSD (Deconvolutional Single Shot Detector) : Enhances SSD with a top‑down deconvolutional module, residual units, and a two‑stage training scheme, achieving strong performance on VOC2007.

DSOD (Deeply Supervised Object Detector) : Trains detectors from scratch without ImageNet pre‑training, using a DenseNet‑style backbone and dense connections in the detection head, resulting in a compact model with competitive mAP.

For each surveyed paper, the survey lists the problem addressed, core ideas, network design, training strategy, and reported performance (e.g., mAP, inference speed) on standard benchmarks such as COCO, PASCAL VOC, and ImageNet.

Overall, the survey highlights the rapid evolution of object detection architectures, the trade‑offs between accuracy and speed, and emerging techniques such as knowledge distillation, multi‑scale feature fusion, and training without external pre‑trained models.

Computer Visiondeep learningobject detectionR-CNNSurveyYOLO
Meitu Technology
Written by

Meitu Technology

Curating Meitu's technical expertise, valuable case studies, and innovation insights. We deliver quality technical content to foster knowledge sharing between Meitu's tech team and outstanding developers worldwide.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.