Deep Dive into Vision Transformer Patch Embedding Mechanisms

This article explains how Vision Transformers convert images into patch embeddings, compares flattening versus convolutional approaches, discusses position and CLS tokens, analyzes the effect of patch size, explores pixel‑level tokens, and contrasts ViT’s inductive bias with CNNs.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Deep Dive into Vision Transformer Patch Embedding Mechanisms

Introduction

Vision Transformer (ViT) is one of the most common techniques for converting images into vector representations that language models can process.

ViT Processing Flow

The diagram shows the generic ViT input‑output pipeline: the input is a matrix of patch embeddings, each representing a local region of the image. The output is a matrix of the same shape, but the embeddings now contain contextual semantic information, and a special CLS token captures global image information.

ViT patch embedding to transformer input and output
ViT patch embedding to transformer input and output

Conventional Patch Embedding

First, the image is divided into a fixed‑size grid of patches. For example, a 12×12 image can be split into nine 4×4 patches, where the number of patches N = width × height / p².

Creating patch embeddings for ViT
Creating patch embeddings for ViT

Each patch is flattened into a one‑dimensional vector that includes all three RGB channels. The flattened vector is multiplied by a projection matrix (patch_size × dimension) to obtain the patch embedding. To retain spatial information, a learnable position‑embedding vector is added to each patch embedding. A CLS token is prepended to the sequence; through self‑attention it aggregates information from all patches and is typically used for image‑level classification.

Convolutional Patch Embedding

Alternatively, convolutional filters can generate patch embeddings. Using a non‑overlapping configuration (stride = 3, kernel_size = 3) produces a 3×3 feature map, i.e., nine patches. Selecting two filters yields a hidden dimension of 2 for each patch. The feature maps are flattened with nn.Flatten(2), reshaping a tensor from (1,2,3,3) to (1,2,9) and then permuting to (1,9,2). Adding the CLS token results in a final shape of (1,10,2), representing ten embedding vectors (nine patches + CLS).

Patch embedding using convolution
Patch embedding using convolution

Impact of Patch Size

Larger patches reduce the number of tokens, lowering the quadratic attention cost (since attention compares each token with every other). However, larger patches embed coarser information, which may miss fine‑grained details such as small text. Smaller patches capture finer details but increase computational load.

Pixel‑Level Tokens

Pixel Transformers (PiT) treat each pixel as an individual token, removing ViT’s inherent locality bias. This can improve classification performance but dramatically increases sequence length: a 32×32 image yields 1 024 tokens, while a 224×224 image yields 50 176 tokens, leading to much higher memory and compute requirements.

Inductive Bias Comparison

CNNs possess strong inductive bias: locality (each layer focuses on small local regions), 2‑D neighborhood structure, and translation equivariance (shifting the input shifts the output similarly). ViT has far weaker image‑specific bias; its MLP layers retain some locality, but the self‑attention layer operates globally across the entire image. Consequently, ViT is more flexible but typically requires more training data to learn spatial relationships from scratch.

Conclusion

Patch embeddings give ViT limited locality. Choosing larger patches improves efficiency, while smaller patches preserve fine details. Convolutional embeddings are simple but may lose continuity; overlapping patches add computational cost but can boost performance. ViT’s weaker inductive bias makes it adaptable to various modalities but also more data‑hungry compared to CNNs.

computer visionvision transformerConvolutionPatch EmbeddingInductive Bias
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.