Deep Dive into Vision Transformer Patch Embedding Mechanisms
This article explains how Vision Transformers convert images into patch embeddings, compares flattening versus convolutional approaches, discusses position and CLS tokens, analyzes the effect of patch size, explores pixel‑level tokens, and contrasts ViT’s inductive bias with CNNs.
Introduction
Vision Transformer (ViT) is one of the most common techniques for converting images into vector representations that language models can process.
ViT Processing Flow
The diagram shows the generic ViT input‑output pipeline: the input is a matrix of patch embeddings, each representing a local region of the image. The output is a matrix of the same shape, but the embeddings now contain contextual semantic information, and a special CLS token captures global image information.
Conventional Patch Embedding
First, the image is divided into a fixed‑size grid of patches. For example, a 12×12 image can be split into nine 4×4 patches, where the number of patches N = width × height / p².
Each patch is flattened into a one‑dimensional vector that includes all three RGB channels. The flattened vector is multiplied by a projection matrix (patch_size × dimension) to obtain the patch embedding. To retain spatial information, a learnable position‑embedding vector is added to each patch embedding. A CLS token is prepended to the sequence; through self‑attention it aggregates information from all patches and is typically used for image‑level classification.
Convolutional Patch Embedding
Alternatively, convolutional filters can generate patch embeddings. Using a non‑overlapping configuration (stride = 3, kernel_size = 3) produces a 3×3 feature map, i.e., nine patches. Selecting two filters yields a hidden dimension of 2 for each patch. The feature maps are flattened with nn.Flatten(2), reshaping a tensor from (1,2,3,3) to (1,2,9) and then permuting to (1,9,2). Adding the CLS token results in a final shape of (1,10,2), representing ten embedding vectors (nine patches + CLS).
Impact of Patch Size
Larger patches reduce the number of tokens, lowering the quadratic attention cost (since attention compares each token with every other). However, larger patches embed coarser information, which may miss fine‑grained details such as small text. Smaller patches capture finer details but increase computational load.
Pixel‑Level Tokens
Pixel Transformers (PiT) treat each pixel as an individual token, removing ViT’s inherent locality bias. This can improve classification performance but dramatically increases sequence length: a 32×32 image yields 1 024 tokens, while a 224×224 image yields 50 176 tokens, leading to much higher memory and compute requirements.
Inductive Bias Comparison
CNNs possess strong inductive bias: locality (each layer focuses on small local regions), 2‑D neighborhood structure, and translation equivariance (shifting the input shifts the output similarly). ViT has far weaker image‑specific bias; its MLP layers retain some locality, but the self‑attention layer operates globally across the entire image. Consequently, ViT is more flexible but typically requires more training data to learn spatial relationships from scratch.
Conclusion
Patch embeddings give ViT limited locality. Choosing larger patches improves efficiency, while smaller patches preserve fine details. Convolutional embeddings are simple but may lose continuity; overlapping patches add computational cost but can boost performance. ViT’s weaker inductive bias makes it adaptable to various modalities but also more data‑hungry compared to CNNs.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
