Overview of Deep Learning Object Detection Methods and Detailed Implementation of Faster R‑CNN
This article reviews major deep‑learning object detection approaches—including one‑stage YOLO and SSD and two‑stage RCNN, Fast RCNN, and Faster RCNN—then provides a step‑by‑step explanation of Faster RCNN’s architecture, region‑proposal network, RoI pooling, loss functions, and sample PyTorch code.
Personal introduction: Liu Ze joined Qunar International ticket‑technology team in 2014 and now works in the strategic development data team on big data and artificial‑intelligence projects, mainly focusing on image and short‑video recognition, detection, and classification.
1. Object Detection Methods in Deep Learning
Object detection predicts both the class and the location (x, y, w, h) of each object. Current methods are divided into one‑stage (e.g., YOLO, SSD) and two‑stage (e.g., RCNN, Fast RCNN, Faster RCNN) approaches.
One‑Stage Methods
YOLO (You Only Look Once) 2015 : The image is resized to 448×448, a single convolution is performed, and the predictions are filtered. YOLO divides the image into an S×S grid and treats detection as a regression problem, outputting a vector such as [pc, px, py, ph, pw, c1, c2, c3] for each cell.
The network uses 1×1 convolutions to reduce feature dimensions. YOLO can run at high fps while maintaining reasonable accuracy.
SSD (Single Shot MultiBox Detector) 2016
SSD improves YOLO by using multi‑scale default boxes and a feature‑pyramid architecture, which increases accuracy for objects of varying sizes. SSD can achieve up to 59 fps, far faster than Faster RCNN’s 7 fps.
Key concepts introduced: IoU (Intersection‑over‑Union) for measuring overlap and NMS (Non‑Maximum Suppression) for filtering redundant boxes.
Two‑Stage Methods
Two‑stage methods add a region‑proposal step before classification.
RCNN (Regions with CNN) 2013
RCNN extracts features with a deep CNN for each region proposal generated by Selective Search, then classifies with an SVM. This pipeline is slow because each region is processed independently.
Fast RCNN 2015
Fast RCNN introduces RoI pooling to share convolutional features across all proposals, reducing computation. It also replaces the SVM with a fully‑connected classification layer, enabling end‑to‑end training with a multi‑task loss.
Faster RCNN 2015
Faster RCNN replaces Selective Search with a Region Proposal Network (RPN) that predicts objectness scores and bounding‑box offsets directly from the shared feature map, greatly speeding up proposal generation.
2. Implementation of Faster RCNN
Faster RCNN consists of three parts:
Backbone network for feature extraction (e.g., ResNet, VGG‑16).
RPN that generates region proposals.
Top‑level network that performs final classification and bounding‑box regression.
2.1 Backbone Network (Feature Extraction)
Example code using VGG‑16 (first 30 layers as features, remaining layers as classifier) and freezing the first 10 convolutional layers:
def decom_vgg16():
# Use VGG‑16 for feature extraction
model = vgg16(pretrained=False)
features = list(model.features)[:30]
# Remove the last classification layer
classifier = list(model.classifier)[:-1]
classifier = nn.Sequential(*classifier)
# Freeze the first 10 layers
for layer in features[:10]:
for p in layer.parameters():
p.requires_grad = False
return nn.Sequential(*features), classifier2.2 Region Proposal Network (RPN)
The RPN consists of a 3×3 convolution followed by two 1×1 convolutions that output objectness scores and bbox regressions:
self.conv1 = nn.Conv2d(in_channels, mid_channels, 3, 1, 1)
self.score = nn.Conv2d(mid_channels, n_anchor * 2, 1, 1, 0) # objectness
self.loc = nn.Conv2d(mid_channels, n_anchor * 4, 1, 1, 0) # bbox offsetsThe RPN loss combines a smooth L1 loss for bbox regression and a cross‑entropy loss for objectness:
def _smooth_l1_loss(x, t, in_weight, sigma):
sigma2 = sigma ** 2
diff = in_weight * (x - t)
abs_diff = diff.abs()
flag = (abs_diff.data < (1. / sigma2)).float()
flag = Variable(flag)
y = (flag * (sigma2 / 2) * (diff ** 2) +
(1 - flag) * (abs_diff - 0.5 / sigma2))
return y.sum()
rpn_cls_loss = F.cross_entropy(rpn_score, gt_rpn_label.cuda(), ignore_index=-1)After obtaining rpn_locs and rpn_scores , proposals are filtered with NMS (implemented in pure Python for illustration):
def py_cpu_nms(dets, thresh):
x1 = dets[:, 0]; y1 = dets[:, 1]
x2 = dets[:, 2]; y2 = dets[:, 3]
scores = dets[:, 4]
areas = (x2 - x1 + 1) * (y2 - y1 + 1)
order = scores.argsort()[::-1]
keep = []
while order.size > 0:
i = order[0]
keep.append(i)
xx1 = np.maximum(x1[i], x1[order[1:]])
yy1 = np.maximum(y1[i], y1[order[1:]])
xx2 = np.minimum(x2[i], x2[order[1:]])
yy2 = np.minimum(y2[i], y2[order[1:]])
w = np.maximum(0.0, xx2 - xx1 + 1)
h = np.maximum(0.0, yy2 - yy1 + 1)
inter = w * h
ovr = inter / (areas[i] + areas[order[1:]] - inter)
inds = np.where(ovr <= thresh)[0]
order = order[inds + 1]
return keep2.3 RoI Pooling and Final Classification/Regression
RoI pooling converts each proposal to a fixed‑size feature map. The implementation iterates over proposals, extracts the corresponding region from the shared feature map, and applies max‑pooling to obtain a 7×7 (or other) output.
class RoIPool(nn.Module):
def __init__(self, pooled_height, pooled_width, spatial_scale):
super(RoIPool, self).__init__()
self.pooled_height = int(pooled_height)
self.pooled_width = int(pooled_width)
self.spatial_scale = float(spatial_scale)
def forward(self, features, rois):
batch_size, num_channels, data_h, data_w = features.size()
num_rois = rois.size(0)
outputs = Variable(torch.zeros(num_rois, num_channels, self.pooled_height, self.pooled_width)).cuda()
for roi_ind, roi in enumerate(rois):
batch_ind = int(roi[0].data[0])
roi_start_w, roi_start_h, roi_end_w, roi_end_h = np.round(
roi[1:].data.cpu().numpy() * self.spatial_scale).astype(int)
roi_width = max(roi_end_w - roi_start_w + 1, 1)
roi_height = max(roi_end_h - roi_start_h + 1, 1)
bin_size_w = float(roi_width) / self.pooled_width
bin_size_h = float(roi_height) / self.pooled_height
for ph in range(self.pooled_height):
hstart = int(np.floor(ph * bin_size_h))
hend = int(np.ceil((ph + 1) * bin_size_h))
hstart = min(data_h, max(0, hstart + roi_start_h))
hend = min(data_h, max(0, hend + roi_start_h))
for pw in range(self.pooled_width):
wstart = int(np.floor(pw * bin_size_w))
wend = int(np.ceil((pw + 1) * bin_size_w))
wstart = min(data_w, max(0, wstart + roi_start_w))
wend = min(data_w, max(0, wend + roi_start_w))
if hend <= hstart or wend <= wstart:
outputs[roi_ind, :, ph, pw] = 0
else:
data = features[batch_ind]
outputs[roi_ind, :, ph, pw] = torch.max(
torch.max(data[:, hstart:hend, wstart:wend], 1)[0], 2)[0].view(-1)
return outputsAfter RoI pooling, the pooled features are fed into two fully‑connected layers (each 4096‑dim). One branch predicts class scores, the other predicts bounding‑box offsets:
self.cls_loc = nn.Linear(4096, n_class * 4) # bbox regression
self.score = nn.Linear(4096, n_class) # classification
fc7 = self.classifier(pool)
roi_cls_locs = self.cls_loc(fc7)
roi_scores = self.score(fc7)During inference, an input image (e.g., 375×500) is resized to a fixed short side (e.g., 600), anchors are generated (e.g., 9 per spatial location), RPN selects ~300 proposals via NMS, RoI pooling normalizes them, and the top‑level network outputs final detections.
Conclusion
The article introduced major object‑detection techniques—one‑stage YOLO/SSD and two‑stage RCNN variants—and provided a detailed walkthrough of Faster RCNN’s architecture, including backbone selection, RPN design, RoI pooling, loss functions, and sample PyTorch implementations. Object detection remains a vibrant research area with applications in autonomous driving, robotics, and aerial imaging.
References
Felzenszwalb & Huttenlocher, "Efficient Graph‑Based Image Segmentation", IJCV 2004.
Girshick, "Fast R‑CNN", arXiv:1504.08083, 2015.
Ren et al., "Faster R‑CNN: Towards Real‑Time Object Detection with Region Proposal Networks", arXiv:1506.01497, 2015.
Lin et al., "Feature Pyramid Networks for Object Detection", arXiv:1612.03144, 2016.
Dalal & Triggs, "Histograms of Oriented Gradients for Human Detection", CVPR 2005.
He et al., "Mask R‑CNN", arXiv:1703.06870, 2017.
Lowe, "Object recognition from local scale‑invariant features", ICCV 1999.
Uijlings et al., "Selective Search for Object Recognition", IJCV 2013.
He et al., "Deep Residual Learning for Image Recognition", arXiv:1512.03385, 2015.
Krizhevsky et al., "ImageNet classification with deep convolutional neural networks", CACM 2012.
GitHub repositories: jwyang/faster‑rcnn.pytorch, chenyuntc/simple‑faster‑rcnn‑pytorch, facebookresearch/Detectron, rbgirshick/fast‑rcnn, rbgirshick/py‑faster‑rcnn.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.