Artificial Intelligence 25 min read

Comprehensive Guide to Vision Transformer (ViT): Architecture, Patch Tokenization, Embedding, Fine‑tuning, and Performance

This article provides an in‑depth, English‑language overview of Vision Transformer (ViT), covering its Transformer‑based architecture, patch‑to‑token conversion, token and position embeddings, fine‑tuning strategies such as 2‑D interpolation, experimental results versus CNNs, and the model’s broader significance for multimodal AI research.

Rare Earth Juejin Tech Community

Jul 12, 2023

Comprehensive Guide to Vision Transformer (ViT): Architecture, Patch Tokenization, Embedding, Fine‑tuning, and Performance

Introduction

The recent AIGC wave has brought the Transformer model to the forefront. While BERT (an Encoder‑only Transformer) sparked the NLP pre‑training boom, GPT (a Decoder‑only model) inspired self‑supervised training research. In 2020 Google introduced Vision Transformer (ViT), a pure‑Transformer image classification model that demonstrates a unified architecture for language, image, and video tasks, becoming a backbone for many large‑scale models.

Model Architecture

ViT mirrors BERT’s Encoder structure. The input is an image split into fixed‑size patches, each treated like a token. A learnable <cls> token is prepended for classification. The architecture consists of a stack of Transformer Encoder layers (L blocks) that process the sequence of patch embeddings.

1.1 BERT Architecture (Reference)

For context, BERT processes token embeddings, position embeddings, and optional segment embeddings, and is trained on two tasks: Next Sentence Prediction (using a special <cls> token) and Masked Language Modeling (using a <mask> token).

1.2 ViT Architecture

Each image patch (e.g., 16×16×3) is flattened and linearly projected to a token of dimension 768, yielding an input matrix X of shape (196, 768) for a 224×224 image. Position embeddings (learnable) are added, and a classification token <cls> is inserted. The resulting sequence is fed into the Transformer Encoder.

From Patch to Token

Given an image of size H×W×C (e.g., 224×224×3) and patch size P=16, the image is divided into (H/P)*(W/P)=196 patches. Each patch is flattened to a 1×768 vector, forming the token matrix X.

To obtain richer token features, ViT applies a convolutional kernel of size 16×16×3 with stride 16, producing a feature map of shape 14×14×768; each 1×1×768 sub‑feature corresponds to a token embedding for a patch.

Embedding Details

ViT uses two main embeddings:

Token Embedding : a learnable matrix E of shape (768, 768) that projects the flattened patch vectors.

Position Embedding : a learnable matrix of shape (196, 768) that encodes the spatial location of each patch.

Four position‑encoding schemes were evaluated:

No positional information.

1‑D absolute positional encoding (learnable, used in the final ViT).

2‑D absolute positional encoding (splits the encoding into row and column components).

Relative positional encoding (adds bias based on pairwise patch distances).

Experiments showed that all schemes except “no positional encoding” performed similarly, so the simple learnable 1‑D absolute encoding was chosen.

Mathematical Formulation of the Architecture

The preprocessing step converts each patch i into a token vector and adds the <cls> token. The Transformer then performs multi‑head self‑attention (equations omitted) followed by a feed‑forward MLP. The final classification head applies layer normalization and a linear layer to the <cls> representation.

Fine‑tuning (Fine‑tune)

During fine‑tuning, the number of patches may increase (e.g., from 196 to 4096 when using 1024×1024 images). Since the learned position embeddings are length‑specific, ViT interpolates the 2‑D position‑embedding matrix to the new sequence length using bicubic interpolation:

new_pos_embedding_img = nn.functional.interpolate(
    pos_embedding_img,
    size=new_seq_length_1d,
    mode=interpolation_mode,
    align_corners=True,
)

This preserves the original corner embeddings while smoothly adapting the interior positions.

Performance Comparison

ViT models (ViT‑Base, ViT‑Large, ViT‑Huge) were compared against CNN baselines (ResNet, Noisy Student) on several ImageNet variants. Accuracy was comparable, while training cost (TPU‑days) was substantially lower for ViT (≈2500 vs. >9900 TPU‑days). The advantage stems from the uniform matrix‑multiplication pattern of Transformers, which scales efficiently across hardware.

Inductive Bias Discussion

CNNs benefit from spatial locality and translation equivariance, which act as strong inductive biases for vision tasks. ViT lacks these biases, relying on large data to learn similar relationships. Experiments show that with small datasets ViT underperforms, but with massive datasets (>21k images) its performance catches up and surpasses CNNs.

Attention Analysis

Analysis of multi‑head attention reveals that deeper layers attend to more distant patches, effectively expanding the receptive field similar to CNNs. Visualizations show that the final layer focuses on semantically important regions of the image.

Position‑Embedding Insights

Cosine‑similarity heatmaps of position embeddings demonstrate that ViT learns a form of spatial locality: patches have high similarity with patches in the same row, column, and nearby vicinity, indicating that position embeddings encode useful spatial relationships.

Conclusion

ViT proves that a unified Transformer framework can handle vision tasks, matching CNNs in accuracy while offering scalability and a foundation for multimodal research. Its success has spurred subsequent work on detection, segmentation, and large‑scale self‑supervised learning in computer vision.

References

https://arxiv.org/pdf/2010.11929.pdf

https://www.bilibili.com/video/BV15P4y137jb/?spm_id_from=333.337.search-card.all.click

https://arxiv.org/pdf/1803.02155.pdf

https://blog.csdn.net/qq_44166630/article/details/127429697

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision Deep Learning Transformer Vision Transformer Patch Embedding ViT³ Fine‑tuning

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.