Artificial Intelligence 9 min read

DeepLabV2: Architecture, Improvements, and Experimental Results

This article introduces DeepLabV2, explains its challenges, architectural enhancements such as the ASPP module, backbone modifications, poly learning‑rate policy, and presents experimental comparisons on several benchmark datasets, providing a concise yet comprehensive overview for computer‑vision practitioners.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
DeepLabV2: Architecture, Improvements, and Experimental Results

In the previous post we covered the principles of DeepLabV1; this article continues with its sibling, DeepLabV2, whose improvements are relatively straightforward once DeepLabV1 is understood.

Hello everyone, I am Xiao Su.

Challenges in Semantic Segmentation

The DeepLabV2 paper highlights three main challenges for DCNNs applied to semantic segmentation:

Reduced feature resolution.

Objects appearing at multiple scales.

Decreased localization accuracy due to the invariance of DCNNs.

Advantages of DeepLabV2

Faster inference: using atrous (dilated) convolution, the dense DCNN runs at 8 fps on an NVIDIA Titan X GPU.

Higher accuracy: state‑of‑the‑art results on PASCAL VOC 2012, PASCAL‑Context, PASCAL‑Person‑Part, and Cityscapes.

Simpler model: the system consists of two well‑designed modules, a DCNN and a CRF.

Key Modifications from DeepLabV1 to DeepLabV2

Added an ASPP (atrous spatial pyramid pooling) multi‑scale structure.

Changed the backbone network to ResNet.

Introduced a poly learning‑rate schedule.

DeepLabV2 Network Structure

ASPP Module

The ASPP ( atrous spatial pyramid pooling ) module consists of four parallel branches applied to the output feature map, each using a 3×3 dilated convolution with dilation rates 6, 12, 18, 24 (or a smaller set 2, 4, 8, 12 for the “ASPP‑S” variant), providing multi‑scale receptive fields.

When using a ResNet backbone, the ASPP structure remains the same, but the later layers after Layer 4 keep the overall down‑sampling factor at 8× and employ dilated convolutions to preserve receptive field.

Backbone Modification

DeepLabV2 adopts ResNet (e.g., ResNet‑101) as its backbone. Up to Layer 2 the architecture matches the original ResNet, after which the network avoids further spatial down‑sampling and replaces the standard convolutions with dilated ones to maintain an 8× down‑sampling factor.

The final feature map is fed into the ASPP module, where each dilated convolution uses number_class filters.

Poly Learning‑Rate Policy

The poly schedule updates the learning rate according to the formula:

lr = base_lr * (1 - step / max_step) ^ power

where power is typically set to 0.9. This schedule yields smoother convergence, as shown by the experimental results.

Experimental Comparison

DeepLabV2 was evaluated on four benchmark datasets: PASCAL VOC 2012, PASCAL‑Context, PASCAL‑Person‑Part, and Cityscapes. The paper reports state‑of‑the‑art performance on all of them, with visual results illustrating the improvement over previous methods.

PASCAL VOC 2012 Semantic Segmentation Benchmark
PASCAL‑Context
PASCAL‑Person‑Part
Cityscapes

Conclusion

DeepLabV2 builds upon DeepLabV1 with a more powerful ASPP module, a ResNet backbone, and a poly learning‑rate schedule, achieving faster inference and higher accuracy on multiple segmentation benchmarks. The next article will cover DeepLabV3.

References

DeepLabV2 Network Overview DeepLabV2 Paper [Paper Notes] DeepLabV2
computer visiondeep learningSemantic segmentationASPPDeepLabV2
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.