How Pre‑Training Evolved: From word2vec to MAE Across NLP and CV

This article traces the history of deep‑learning pre‑training techniques, comparing the parallel developments in natural‑language processing and computer vision—from early word2vec and bag‑of‑words models through ELMo and BERT to recent transformer‑based vision models like iGPT, ViT, BEiT and MAE—highlighting key innovations, challenges, and the convergence of the two fields.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
How Pre‑Training Evolved: From word2vec to MAE Across NLP and CV

Introduction

The rapid progress of deep learning owes much to the evolution of pre‑training methods, which have shaped both natural‑language processing (NLP) and computer‑vision (CV) over the past decade. By following the timeline from 2013’s word2vec to 2020’s MAE, we can see how representation learning has become deeper, more contextual, and increasingly unified under the transformer architecture.

Semantic Representation Evolution

Representation learning can be divided into three stages: (1) feature‑engineering with bag‑of‑words, (2) shallow embeddings such as word2vec, and (3) deep transformer‑based embeddings exemplified by BERT. Each stage improves the richness of semantic encoding, moving from surface‑level token counts to contextualized vectors that capture meaning across sentences.

Word2Vec and Language Models

Word2vec introduced distributed word embeddings, solving the semantic limitation of bag‑of‑words by learning vectors that exhibit clustering and linear relationships (e.g., king – queen ≈ man – woman). However, it still lacked contextual awareness, prompting the development of neural language models (NNLM) that predict a word given its surrounding context using a simple MLP with shared embedding matrices.

The NNLM, pioneered by Bengio, laid the groundwork for modern self‑supervised training: a lookup table (the embedding matrix) is learned jointly with the prediction network, and the extracted rows become the word vectors.

Word2vec employs two training objectives—CBOW and Skip‑Gram—to predict a target word from its context or vice‑versa, and uses tricks such as negative sampling and hierarchical softmax to scale to large corpora.

From NLP to CV Pre‑training

Early CV pre‑training relied on transferring weights from large classification datasets (e.g., ImageNet) to downstream tasks, a straightforward but label‑heavy approach. In contrast, NLP had already embraced self‑supervised objectives that required no manual annotations.

Bridging this gap led to self‑supervised vision methods, the most notable being contrastive learning, which builds representations by pulling together augmented views of the same image while pushing apart different images.

iGPT: Generative Pre‑training for Images

iGPT adapts the GPT autoregressive framework to images by flattening pixel grids into sequences. It uses two objectives: (1) pixel‑wise autoregressive prediction and (2) masked image modeling (MLM) analogous to BERT’s masked language modeling. To handle the quadratic self‑attention cost, iGPT first down‑samples images via spatial pooling and color quantization (9‑bit K‑means), then trains a large transformer—requiring massive compute (over 2,500 V100‑days) and still suffering from information loss due to aggressive compression.

Vision Transformer (ViT)

ViT replaces iGPT’s heavy down‑sampling with a patch‑embedding scheme: an image is split into fixed‑size patches, each linearly projected to a token vector, and positional embeddings are added, mirroring BERT’s tokenization. The transformer processes the sequence of patch tokens, and a special [CLS] token aggregates a global representation for classification.

Although ViT achieved strong supervised performance, its original pre‑training remained classification‑centric, prompting research into self‑supervised variants.

BEiT: BERT‑style Pre‑training for Vision

BEiT inherits ViT’s patch tokens but changes the pre‑training task to masked patch prediction. A discrete VAE (dVAE) first learns a visual token vocabulary; during pre‑training, random patches are masked and the model predicts the corresponding visual tokens, similar to BERT’s MLM but at the patch level.

Training proceeds in two stages: (1) optimize the dVAE for high‑fidelity image reconstruction, and (2) fine‑tune the transformer encoder with the masked‑image‑modeling head.

MAE: Masked Autoencoders for Vision

MAE further simplifies BEiT’s pipeline by using an asymmetric auto‑encoder: only a subset of visible patches are fed to a lightweight transformer encoder, while a shallow decoder reconstructs the missing patches. This design dramatically reduces compute, allowing fast training while still learning powerful representations capable of reconstructing 75% of masked pixels.

MAE pre‑training on relatively modest data outperforms larger supervised ViT models, achieving up to 87.8% top‑1 accuracy on ImageNet when fine‑tuned.

Summary of CV Evolution

Across CV, the trajectory mirrors NLP: from shallow CNN backbones trained on labeled data, to self‑supervised contrastive methods, to transformer‑based models that adopt NLP‑style tokenization and masked‑modeling objectives. The convergence is evident in the shared reliance on large‑scale self‑supervision, patch embeddings, and the transformer’s ability to scale depth and parallelism.

Conclusion

Pre‑training has transformed both NLP and CV, moving from simple statistical tricks to sophisticated self‑supervised transformers. Understanding this history clarifies why modern models emphasize contextual, high‑dimensional representations and points toward future research that further unifies language and vision learning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningNLPpretrainingWord2VecMAE
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.