Detailed Explanation of Fully Convolutional Networks (FCN) for Semantic Segmentation
This article provides a comprehensive, beginner‑friendly overview of semantic segmentation, focusing on the pioneering Fully Convolutional Network (FCN) architecture, its variants (FCN‑32s, FCN‑16s, FCN‑8s), underlying concepts, loss computation, and practical tips for working with the VOC dataset.
Deep Learning Semantic Segmentation – FCN Principle Detailed Explanation
In previous posts I introduced classic classification networks (VGG, GoogLeNet, ResNet) and object detection methods (YOLO series, R‑CNN series). This article shifts focus to semantic segmentation, specifically the seminal Fully Convolutional Networks (FCN) paper.
What Is Semantic Segmentation?
Semantic segmentation assigns a class label to every pixel in an image. Compared with image classification (assigning a single label) and object detection (bounding boxes), semantic segmentation produces a fine‑grained mask that follows object boundaries, while instance segmentation further distinguishes individual object instances.
Overall FCN Architecture
The input is an RGB image fed into a feature‑extraction backbone (AlexNet in the original FCN, VGG16 in many tutorials). Fully connected layers are replaced by convolutional layers so the network can handle arbitrary image sizes. The final feature map has 21 channels (20 foreground classes + background) for the VOC dataset. This map is up‑sampled to the original image resolution, and the class with the highest score at each pixel is taken as the prediction.
FCN Variants
Three variants are described in the original paper:
FCN‑32s – up‑samples the final feature map by a factor of 32.
FCN‑16s – combines the 32× up‑sampled map with a shallower 16× feature map before up‑sampling.
FCN‑8s – further fuses the 8× feature map, yielding the finest spatial resolution.
FCN‑32s Structure
After VGG16 down‑sampling by 32×, the feature map size is h/32 × w/32 × 512 . The last three fully‑connected layers are converted to 1×1 convolutions, producing a h/32 × w/32 × 21 tensor. A transposed convolution (or bilinear interpolation) upsamples it to h × w × 21 , and the arg‑max over the 21 channels yields the per‑pixel class.
FCN‑16s Structure
The 16× feature map ( h/16 × w/16 × 512 ) is passed through a 1×1 convolution to obtain h/16 × w/16 × 21 . This is added to the up‑sampled 32× output (now also h/16 × w/16 × 21 ) and then up‑sampled by a factor of 16 to obtain the final prediction.
FCN‑8s Structure
In addition to the 32× and 16× streams, the 8× feature map ( h/8 × w/8 × 512 ) is also transformed to h/8 × w/8 × 21 . All three streams are summed and up‑sampled by a factor of 8, delivering the highest‑resolution segmentation.
Loss Computation
The ground‑truth (GT) segmentation mask is a single‑channel P‑mode image of size h × w × 1 . The network output is h × w × 21 . A pixel‑wise cross‑entropy loss is computed between the GT and the softmax of the output, encouraging the predicted distribution to match the true class at each pixel.
VOC Dataset Annotation Details (Appendix)
The VOC2012 SegmentationClass folder contains PNG files in P‑mode (palette) format. Each pixel stores an index (0‑255) that maps to a color in the palette, representing a class label. The corresponding RGB images reside in VOC2012/JPEGImages . Example code to load and compare the two formats is provided, demonstrating that the P‑mode image is single‑channel while the RGB image has three channels.
img2 = Image.open('.../JPEGImages/2007_000032.jpg')
img3 = Image.open('.../SegmentationClass/2007_000032.png')
plt.imshow(img2)
plt.imshow(img3)
print('image2:', img2.mode)
print('image3:', img3.mode)Converting the images to NumPy arrays shows shapes (h, w, 3) for RGB and (h, w) for P‑mode, confirming the single‑channel nature of the annotation.
Conclusion
The theoretical part of FCN is now covered. The next article will dive into the actual code implementation, especially the cross‑entropy loss function, to solidify understanding of FCN training.
Paper Download
FCN paper download 🥝🥝🥝
References
FCN network structure detailed (semantic segmentation) 🍁🍁🍁 Fully Convolutional Networks 🍁🍁🍁
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.