Artificial Intelligence 14 min read

Detailed Explanation of Fully Convolutional Networks (FCN) for Semantic Segmentation

This article provides a comprehensive, beginner‑friendly overview of semantic segmentation, focusing on the pioneering Fully Convolutional Network (FCN) architecture, its variants (FCN‑32s, FCN‑16s, FCN‑8s), underlying concepts, loss computation, and practical tips for working with the VOC dataset.

Rare Earth Juejin Tech Community

Nov 9, 2022

Detailed Explanation of Fully Convolutional Networks (FCN) for Semantic Segmentation

Deep Learning Semantic Segmentation – FCN Principle Detailed Explanation

In previous posts I introduced classic classification networks (VGG, GoogLeNet, ResNet) and object detection methods (YOLO series, R‑CNN series). This article shifts focus to semantic segmentation, specifically the seminal Fully Convolutional Networks (FCN) paper.

What Is Semantic Segmentation?

Semantic segmentation assigns a class label to every pixel in an image. Compared with image classification (assigning a single label) and object detection (bounding boxes), semantic segmentation produces a fine‑grained mask that follows object boundaries, while instance segmentation further distinguishes individual object instances.

Overall FCN Architecture

The input is an RGB image fed into a feature‑extraction backbone (AlexNet in the original FCN, VGG16 in many tutorials). Fully connected layers are replaced by convolutional layers so the network can handle arbitrary image sizes. The final feature map has 21 channels (20 foreground classes + background) for the VOC dataset. This map is up‑sampled to the original image resolution, and the class with the highest score at each pixel is taken as the prediction.

FCN Variants

Three variants are described in the original paper:

FCN‑32s – up‑samples the final feature map by a factor of 32.

FCN‑16s – combines the 32× up‑sampled map with a shallower 16× feature map before up‑sampling.

FCN‑8s – further fuses the 8× feature map, yielding the finest spatial resolution.

FCN‑32s Structure

After VGG16 down‑sampling by 32×, the feature map size is h/32 × w/32 × 512. The last three fully‑connected layers are converted to 1×1 convolutions, producing a h/32 × w/32 × 21 tensor. A transposed convolution (or bilinear interpolation) upsamples it to h × w × 21, and the arg‑max over the 21 channels yields the per‑pixel class.

FCN‑16s Structure

The 16× feature map ( h/16 × w/16 × 512) is passed through a 1×1 convolution to obtain h/16 × w/16 × 21. This is added to the up‑sampled 32× output (now also h/16 × w/16 × 21) and then up‑sampled by a factor of 16 to obtain the final prediction.

FCN‑8s Structure

In addition to the 32× and 16× streams, the 8× feature map ( h/8 × w/8 × 512) is also transformed to h/8 × w/8 × 21. All three streams are summed and up‑sampled by a factor of 8, delivering the highest‑resolution segmentation.

Loss Computation

The ground‑truth (GT) segmentation mask is a single‑channel P‑mode image of size h × w × 1. The network output is h × w × 21. A pixel‑wise cross‑entropy loss is computed between the GT and the softmax of the output, encouraging the predicted distribution to match the true class at each pixel.

VOC Dataset Annotation Details (Appendix)

The VOC2012 SegmentationClass folder contains PNG files in P‑mode (palette) format. Each pixel stores an index (0‑255) that maps to a color in the palette, representing a class label. The corresponding RGB images reside in VOC2012/JPEGImages. Example code to load and compare the two formats is provided, demonstrating that the P‑mode image is single‑channel while the RGB image has three channels.

img2 = Image.open('.../JPEGImages/2007_000032.jpg')
img3 = Image.open('.../SegmentationClass/2007_000032.png')
plt.imshow(img2)
plt.imshow(img3)
print('image2:', img2.mode)
print('image3:', img3.mode)

Converting the images to NumPy arrays shows shapes (h, w, 3) for RGB and (h, w) for P‑mode, confirming the single‑channel nature of the annotation.

Conclusion

The theoretical part of FCN is now covered. The next article will dive into the actual code implementation, especially the cross‑entropy loss function, to solidify understanding of FCN training.

Paper Download

FCN paper download 🥝🥝🥝

References

FCN network structure detailed (semantic segmentation) 🍁🍁🍁 Fully Convolutional Networks 🍁🍁🍁

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision Semantic Segmentation VGG16 FCN AlexNet

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.