Artificial Intelligence 13 min read

Recent Advances on Object Detection: R‑FCN, Deformable ConvNets, and Video Object Detection

The article summarizes Jifeng Dai's 2018 AI Pioneer talk on recent object‑detection breakthroughs, detailing R‑FCN and its extensions, Deformable ConvNets, video object detection techniques, and concluding remarks on remaining challenges in large‑scale and mobile vision.

DataFunTalk

Oct 10, 2018

Recent Advances on Object Detection: R‑FCN, Deformable ConvNets, and Video Object Detection

This article, compiled by the DataFun community, is based on Lead Researcher Jifeng Dai's presentation at the 2018 AI Pioneer Conference titled "Recent Advances on Object Detection in MSRA".

The talk is organized into four sections: (1) R‑FCN and its extensions, (2) Deformable ConvNets and their extensions, (3) video object detection work, and (4) a concise summary.

1. R‑FCN and its extensions

R‑FCN (Region‑based Fully‑Convolutional Networks) achieves very fast and accurate object detection by applying a fully‑convolutional architecture to the whole image and generating a shared score map. ROI computation becomes negligible, and the network can be trained end‑to‑end, matching Fast R‑CNN in accuracy while being much faster.

The paper discusses why earlier designs inserted ROI pooling between convolutional layers, which conflicted with the need for translation invariance in classification versus translation variance in detection. Position‑sensitive score maps are introduced to encode spatial information for each object class.

R‑FCN uses a position‑sensitive ROI pooling layer that selects different channel groups for each spatial bin, producing a compact output without extra computation.

Extensions include Light‑head R‑CNN, which reduces ROI computation, and R‑FCN‑3000 at 30 fps, which decouples classification and localization for large‑scale scenarios.

2. Deformable ConvNets and its extensions

Deformable ConvNets model geometric transformations directly within the convolution and ROI‑pooling modules, learning 2‑D offsets for each sampling location without additional supervision. This enhances the network's ability to handle complex shape variations.

The design adds an extra convolution layer to predict x and y offsets (18‑dimensional output) for each position, which are then applied to the regular convolution kernels, resulting in deformable convolution.

Similarly, Deformable ROI Pooling adds learned offsets to each ROI bin, allowing adaptive spatial sampling while keeping the same input‑output interface as regular ROI pooling.

These modules improve modeling capacity for free‑form deformations and can be trained end‑to‑end with gradient flow through the offset prediction.

Visualizations show how deformable convolutions adapt sampling locations on foreground versus background regions, demonstrating the network's ability to allocate larger receptive fields where needed.

3. Video Object Detection

The team extended these ideas to video object detection, achieving top performance in the ImageNet VID 2017 challenge and developing high‑performance models suitable for mobile and embedded devices.

These video models focus on low computational cost while maintaining high accuracy, addressing challenges such as large appearance variations and real‑time inference on mobile platforms.

4. Summary

General object detection remains an open, unsolved problem with significant challenges in handling large appearance variations, mobile deployment, full‑scene understanding, and pixel‑level recognition. Despite progress, practical applications such as robust detection of pedestrians and vehicles still have considerable room for improvement.

Author Introduction: Jifeng Dai is a Lead Researcher in the Visual Computing Group at Microsoft Research Asia. His research focuses on deep learning for high‑level vision, especially instance recognition. He is the first author of R‑FCN and Deformable ConvNets and has won the COCO challenge in 2015 and 2016.

—END—

Note: Follow the public account at the end of the article and reply with "先行者" to download the speaker's PPT.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision object detection Deformable ConvNets R-FCN video detection

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.