How Pre‑Training Evolved: From word2vec to MAE Across NLP & Vision

This article traces the evolution of deep‑learning pre‑training techniques, starting with word2vec in NLP, moving through ELMo and BERT, then shifting to computer‑vision models such as iGPT, ViT, BEiT, and MAE, highlighting key innovations, challenges, and the convergence of NLP and CV paradigms.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
How Pre‑Training Evolved: From word2vec to MAE Across NLP & Vision

Pre‑training in Natural Language Processing (NLP)

Early NLP models relied on shallow feature engineering such as bag‑of‑words. The first major shift was the introduction of distributed word representations.

Word2vec (2013)

Word2vec learns a lookup table of dense vectors by optimizing two self‑supervised objectives on large unlabelled corpora:

Continuous Bag‑of‑Words (CBOW) : predict a target word from its surrounding context.

Skip‑Gram : predict surrounding words given a target word.

Training is accelerated with negative‑sampling and Huffman‑tree based hierarchical softmax, enabling the extraction of word vectors that capture semantic similarity (e.g., “king – queen = man – woman”). However, the vectors are static and cannot model context‑dependent meanings.

Neural Language Model (NNLM)

Bengio et al. introduced a simple feed‑forward network that shares an embedding matrix across the vocabulary, concatenates embeddings of the previous n words, passes them through a tanh‑activated hidden layer, and predicts the next word with a softmax over the vocabulary. This architecture demonstrated that a learned embedding matrix can serve as a general‑purpose semantic representation.

ELMo (2018)

ELMo replaces static embeddings with deep bidirectional LSTM layers. The model is trained on a language modelling objective (forward and backward) and the resulting hidden states are used as contextual word representations. This two‑stage paradigm (pre‑train → fine‑tune) integrates representation learning and downstream task learning, but LSTM’s sequential nature limits parallelism and scalability.

BERT (2019)

BERT adopts the Transformer encoder, enabling full parallel computation and deep stacking. Its pre‑training objectives are:

Masked Language Modeling (MLM) : randomly mask 15 % of tokens and predict them, forcing the model to use both left and right context.

Next Sentence Prediction (NSP) : predict whether two sentences appear consecutively, encouraging inter‑sentence coherence.

Key architectural details include:

WordPiece sub‑word tokenization (BPE) to handle out‑of‑vocabulary words.

Learned position embeddings to inject order information.

The result is a deep, bidirectional contextual encoder that can be fine‑tuned on a wide range of downstream tasks (classification, QA, NER, etc.) with minimal architectural changes.

Pre‑training in Computer Vision (CV)

CV initially relied on supervised ImageNet pre‑training, where a convolutional network learns generic visual features from millions of labelled images. This approach works well for transfer learning but depends heavily on large annotated datasets.

Contrastive Learning

Contrastive self‑supervised methods construct two augmented views of the same image and train a backbone to produce similar embeddings for the positive pair while pushing apart embeddings of different images. The loss typically follows a InfoNCE formulation:

loss = -log \frac{exp(sim(z_i, z_j)/\tau)}{\sum_{k=1}^{N} exp(sim(z_i, z_k)/\tau)}

where z_i and z_j are embeddings of two augmentations of the same image, sim is cosine similarity, and \tau is a temperature hyper‑parameter.

iGPT (2020)

iGPT adapts the GPT‑style autoregressive objective to images:

Pixels are raster‑scanned (row‑major) and flattened into a 1‑D token sequence.

The model predicts the next pixel token given all previous tokens (auto‑regressive loss).

Additionally, a masked language modelling objective is applied at the pixel level.

Because raw images are high‑dimensional, iGPT reduces computational cost by aggressive down‑sampling and colour quantization (e.g., K‑means to 9‑bit colour). Despite achieving strong generative performance, iGPT requires >2× the parameters of comparable CNNs and training times on the order of thousands of GPU days.

Vision Transformer (ViT)

ViT tokenizes an image into fixed‑size patches (e.g., 16×16 pixels), linearly projects each patch to a token embedding, adds learned position embeddings, and feeds the sequence to a standard Transformer encoder. No convolutional down‑sampling is used, preserving spatial resolution. A special [CLS] token aggregates a global representation for classification.

BEiT (2021)

BEiT builds on ViT’s patch tokenization but replaces the classification pre‑training objective with a masked image modelling (MIM) task:

A discrete VAE (dVAE) first learns a visual token vocabulary from image patches.

During pre‑training, a subset of patch tokens is masked and the model predicts the corresponding visual tokens (similar to BERT’s MLM).

Training proceeds in two stages: (1) pre‑train the dVAE to obtain a high‑quality visual tokenizer, (2) train the ViT encoder to recover masked tokens.

Masked Autoencoder (MAE, 2021)

MAE simplifies BEiT by using a single‑stage encoder‑decoder architecture:

Only a random subset (e.g., 75 %) of patches is fed to the encoder; the remaining patches are replaced by mask tokens.

The lightweight decoder receives both visible patch embeddings and mask tokens, and reconstructs the missing pixel values.

Positional embeddings are added to both encoder and decoder inputs to preserve spatial information.

This design dramatically reduces training FLOPs (the encoder processes far fewer tokens) while achieving state‑of‑the‑art self‑supervised performance on downstream tasks such as image classification, detection, and segmentation.

Convergence of NLP and CV Pre‑training

Both fields have moved from shallow, hand‑crafted features to deep, transformer‑based contextual representations. The common pipeline now consists of:

Tokenization (words or sub‑words for NLP; image patches for CV).

Learned position embeddings to encode order or spatial layout.

A masked reconstruction objective (MLM for text, MIM for images) that forces the model to infer missing information from surrounding context.

These shared principles illustrate how self‑supervised learning has become the unifying foundation for modern AI, enabling large‑scale pre‑training on unlabelled data and efficient transfer to downstream tasks across modalities.

Pre‑training evolution overview
Pre‑training evolution overview
Semantic representation stages
Semantic representation stages
BERT components
BERT components
iGPT pipeline
iGPT pipeline
ViT architecture
ViT architecture
MAE encoder‑decoder
MAE encoder‑decoder
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

NLPpretrainingBERTWord2VecMAE
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.