Why DropBlock Outperforms Dropout as an Image Regularizer

This article demonstrates how to implement DropBlock in PyTorch, explains why Dropout fails on image data, details the gamma calculation and mask generation, and shows visual comparisons that illustrate the superiority of contiguous region dropping over random pixel dropout.

Code DAO
Code DAO
Code DAO
Why DropBlock Outperforms Dropout as an Image Regularizer

Introduction

Ghiasi et al. introduced DropBlock as a regularization technique specifically designed for images, and empirical results show it works better than standard Dropout.

Problems with Dropout on Images

Dropout randomly zeroes individual input elements before passing them to the next layer. While easy to use via nn.Dropout() in PyTorch, it discards independent pixels, which is ineffective for 2‑D feature maps because nearby activations contain useful semantic information.

import torch
import matplotlib.pyplot as plt
from torch import nn

# keeping one channel for better visualisation
x = torch.ones((1, 1, 16, 16))
drop = nn.Dropout()
x_drop = drop(x)

to_plot = lambda x: x.squeeze(0).permute(1,2,0).numpy()
fig, axs = plt.subplots(1, 2)
axs[0].imshow(to_plot(x), cmap='gray')
axs[1].imshow(to_plot(x_drop), cmap='gray')

The figure shows random pixels being dropped, which does not remove semantic information effectively.

Visualizing Dropout on Feature Maps

To see how Dropout affects feature maps, the author loads a Baby Yoda image, passes it through a pretrained ResNet‑18, extracts the third‑layer feature map, and applies Dropout followed by ReLU.

import requests
from glasses.models import AutoModel, AutoTransform
from PIL import Image
from io import BytesIO

# get an image of baby yoda
r = requests.get('https://upload.wikimedia.org/wikipedia/en/0/00/The_Child_aka_Baby_Yoda_%28Star_Wars%29.jpg')
img = Image.open(BytesIO(r.content))
x = AutoTransform.from_name('resnet18')(img)  # transform to model input
model = AutoModel.from_pretrained('resnet18').eval()
with torch.no_grad():
    model.encoder.features
    model(x.unsqueeze(0))
    features = model.encoder.features
f = features[2]  # third layer output [1,128,28,28]

f_drop = nn.Sequential(nn.Dropout(), nn.ReLU())(f)
f_l = nn.ReLU()(f)[:,0,:,:]
f_drop_l = f_drop[:,0,:,:]

fig, axs = plt.subplots(1, 2)
axs[0].imshow(f_l.squeeze().numpy())
axs[1].imshow(f_drop_l.squeeze().numpy())

The left panel shows the original activations; the right panel shows activations after Dropout. They remain very similar, indicating that zeroing isolated pixels does not significantly disrupt information flow.

DropBlock Mechanism

DropBlock addresses Dropout’s limitation by removing contiguous regions from the feature map, thereby better destroying semantic information. The core idea is illustrated in the following diagram.

The principle of DropBlock is shown below.

Implementation Details

First, define a DropBlock layer with the required parameters.

from torch import nn
import torch
from torch import Tensor

class DropBlock(nn.Module):
    def __init__(self, block_size: int, p: float = 0.5):
        super().__init__()
        self.block_size = block_size
        self.p = p

    def calculate_gamma(self, x: Tensor) -> float:
        """Compute gamma, eq (1) in the paper"""
        invalid = (1 - self.p) / (self.block_size ** 2)
        valid = (x.shape[-1] ** 2) / ((x.shape[-1] - self.block_size + 1) ** 2)
        return invalid * valid

Here block_size is the side length of the region to drop, and p plays the same role as the keep probability in Dropout.

The next step is to sample a mask of the same size as the input from a Bernoulli distribution with the computed gamma.

gamma = self.calculate_gamma(x)
mask = torch.bernoulli(torch.ones_like(x) * gamma)

To turn the binary mask into contiguous blocks, a max‑pooling operation with kernel size equal to block_size and stride 1 is applied. The pooled result is inverted to obtain the final block mask.

mask_block = 1 - F.max_pool2d(
    mask,
    kernel_size=(self.block_size, self.block_size),
    stride=(1, 1),
    padding=(self.block_size // 2, self.block_size // 2),
)

The regularized output is then computed by scaling the input with the mask and a normalization factor.

x = mask_block * x * (mask_block.numel() / mask_block.sum())

The complete forward method integrates these steps.

import torch.nn.functional as F

class DropBlock(nn.Module):
    def __init__(self, block_size: int, p: float = 0.5):
        super().__init__()
        self.block_size = block_size
        self.p = p

    def calculate_gamma(self, x: Tensor) -> float:
        invalid = (1 - self.p) / (self.block_size ** 2)
        valid = (x.shape[-1] ** 2) / ((x.shape[-1] - self.block_size + 1) ** 2)
        return invalid * valid

    def forward(self, x: Tensor) -> Tensor:
        if self.training:
            gamma = self.calculate_gamma(x)
            mask = torch.bernoulli(torch.ones_like(x) * gamma)
            mask_block = 1 - F.max_pool2d(
                mask,
                kernel_size=(self.block_size, self.block_size),
                stride=(1, 1),
                padding=(self.block_size // 2, self.block_size // 2),
            )
            x = mask_block * x * (mask_block.numel() / mask_block.sum())
        return x

Testing DropBlock on Baby Yoda

import torchvision.transforms as T
r = requests.get('https://upload.wikimedia.org/wikipedia/en/0/00/The_Child_aka_Baby_Yoda_%28Star_Wars%29.jpg')
img = Image.open(BytesIO(r.content))
tr = T.Compose([T.Resize((224, 224)), T.ToTensor()])
x = tr(img)

drop_block = DropBlock(block_size=19, p=0.8)
x_drop = drop_block(x)

fig, axs = plt.subplots(1, 2)
axs[0].imshow(to_plot(x))
axs[1].imshow(x_drop[0,:,:].squeeze().numpy())

The result shows contiguous regions being zeroed, confirming that DropBlock removes blocks rather than isolated neurons.

Additional Observations

When block_size = 1, DropBlock behaves exactly like Dropout. When block_size equals the full feature‑map size, it becomes equivalent to Dropout2d (also known as SpatialDropout).

Conclusion

We now know how to implement DropBlock in PyTorch. The original paper reports a series of experiments on a vanilla ResNet‑50, progressively adding regularization methods; the results are summarized in the table below.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionDeep LearningPyTorchRegularizationDropoutDropBlock
Code DAO
Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.