Artificial Intelligence 15 min read

Understanding and Reproducing MAE (Masked AutoEncoder) for Self‑Supervised Vision Learning with EasyCV

This article introduces the MAE (Masked AutoEncoder) self‑supervised learning method, explains its asymmetric encoder‑decoder design and high masking ratio, evaluates its performance, and provides a step‑by‑step guide to reproduce MAE using Alibaba’s EasyCV framework, including code snippets, training tips, and troubleshooting.

DataFunTalk

Jun 9, 2022

Understanding and Reproducing MAE (Masked AutoEncoder) for Self‑Supervised Vision Learning with EasyCV

MAE (Masked AutoEncoder) is a self‑supervised learning method that randomly masks a high proportion (75%) of image patches and reconstructs them using an asymmetric encoder‑decoder architecture.

The encoder follows a ViT design, processing only the visible patches, while the decoder is lightweight and receives both encoded visible tokens and learned mask tokens, with positional embeddings added before reconstruction.

Masking is performed by uniformly sampling patches and using a random noise vector to sort and select patches to keep; the binary mask and restore indices are returned for loss computation.

# embed patches
x = self.patch_embed(x)
# add pos embed w/o cls token
x = x + self.pos_embed[:, 1:, :]

def random_masking(self, x, mask_ratio):
    """Perform per-sample random masking by per-sample shuffling."""
    N, L, D = x.shape  # batch, length, dim
    len_keep = int(L * (1 - mask_ratio))
    noise = torch.rand(N, L, device=x.device)  # noise in [0, 1]
    ids_shuffle = torch.argsort(noise, dim=1)  # ascend: small is keep, large is remove
    ids_restore = torch.argsort(ids_shuffle, dim=1)
    ids_keep = ids_shuffle[:, :len_keep]
    x_masked = torch.gather(x, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, D))
    mask = torch.ones([N, L], device=x.device)
    mask[:, :len_keep] = 0
    mask = torch.gather(mask, dim=1, index=ids_restore)
    return x_masked, mask, ids_restore

During pre‑training the decoder predicts pixel values for masked patches; the loss is the mean‑squared error between the predicted and normalized target patches, computed only on masked locations.

def forward_loss(self, imgs, pred, mask):
    target = self.patchify(imgs)
    if self.norm_pix_loss:
        mean = target.mean(dim=-1, keepdim=True)
        var = target.var(dim=-1, keepdim=True)
        target = (target - mean) / (var + 1.e-6)**.5
    loss = (pred - target)**2
    loss = loss.mean(dim=-1)  # [N, L]
    loss = (loss * mask).sum() / mask.sum()
    return loss

Experiments show that MAE achieves over three times faster training and higher accuracy (e.g., ViT‑Huge reaches 87.8% top‑1 on ImageNet‑1K) compared with other self‑supervised methods. Linear probing, partial fine‑tuning, and full fine‑tuning evaluations demonstrate strong representation learning, especially for non‑linear features.

EasyCV, Alibaba’s open‑source PyTorch‑based framework, provides unified APIs for data handling, model training, evaluation, and deployment. Using EasyCV, the MAE model can be reproduced with minimal code changes.

Typical reproduction steps include:

Prepare image data (e.g., ImageNet or a small demo dataset).

Configure the model via a Python config file (e.g., mae_vit_base_patch16_8xb64_1600e.py).

Run distributed training with python -m torch.distributed.launch ….

Fine‑tune the pretrained checkpoint on downstream classification tasks, adjusting the state dict as needed.

Key practical tips: use mixup + cutmix augmentation during fine‑tuning, average token features instead of the CLS token, set a sufficiently large weight decay (≈0.05) during pre‑training to avoid gradient explosion, and keep the decoder lightweight (≤10% of encoder computation).

All code snippets, configuration files, and detailed logs are available in the EasyCV GitHub repository.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PyTorch self-supervised learning Vision Transformer EasyCV MAE

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.