A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers
This article introduces the fundamentals of Vision Transformers (ViT) for computer‑vision developers, starting with an overview of the transformer architecture, detailed explanation of self‑attention and multi‑head attention, and step‑by‑step PyTorch code examples that illustrate query, key, value computation and attention scoring.
A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers
Introduction
In recent years Vision Transformers (ViT) have dominated many benchmarks in image classification, object detection, and semantic segmentation, much like ResNet did in 2015. This article is the first in a three‑part series that aims to guide computer‑vision programmers, who may be unfamiliar with NLP, through the fundamentals of transformers and their application to vision.
Part 1: Introduction to the transformer architecture in the NLP domain – the essential prerequisite for understanding ViT.
Part 2: Detailed explanation of ViT, i.e., how the transformer model is applied to visual data.
Part 3: Walk‑through of ViT code, demystifying the implementation and encouraging hands‑on experimentation.
Let’s begin the deep dive into transformers!
Overall Framework
Before presenting the transformer framework, it is useful to understand why the transformer structure is advantageous. In NLP, prior to transformers the dominant models were RNNs and LSTMs, both of which suffer from poor parallelism because each time step depends on the previous one. For example, translating the sentence "I Love China" to "我爱中国" with an RNN requires sequential generation of each token, making parallel computation impossible.
The transformer solves this limitation by allowing all positions to be processed simultaneously, which will be explained in detail later.
The overall transformer architecture is shown below (image extracted from the original paper):
The diagram may look intimidating at first glance, but the subsequent sections will break it down step by step.
Self‑Attention
The self‑attention module is the core of the transformer. The following steps describe its execution, accompanied by code snippets and visual illustrations.
Step 1: Prepare Input
import torch
x = [
[1, 0, 1, 0], # Input 1
[0, 2, 0, 2], # Input 2
[1, 1, 1, 1] # Input 3
]
x = torch.tensor(x, dtype=torch.float32)The resulting tensor is:
tensor([[1., 0., 1., 0.],
[0., 2., 0., 2.],
[1., 1., 1., 1.]])Step 2: Initialize Weight Matrices
w_query = [
[1, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 1]
]
w_key = [
[0, 0, 1],
[1, 1, 0],
[0, 1, 0],
[1, 1, 0]
]
w_value = [
[0, 2, 0],
[0, 3, 0],
[1, 0, 3],
[1, 1, 0]
]
w_query = torch.tensor(w_query, dtype=torch.float32)
w_key = torch.tensor(w_key, dtype=torch.float32)
w_value = torch.tensor(w_value, dtype=torch.float32)Step 3: Generate Q, K, V
querys = x @ w_query
keys = x @ w_key
values = x @ w_valueResulting tensors:
# Q
tensor([[1., 0., 2.],
[2., 2., 2.],
[2., 1., 3.]])
# K
tensor([[0., 1., 1.],
[4., 4., 0.],
[2., 3., 1.]])
# V
tensor([[1., 2., 3.],
[2., 8., 0.],
[2., 6., 3.]])Step 4: Compute Attention Scores
attn_scores = querys @ keys.TAttention scores:
tensor([[ 2., 4., 4.],
[ 4., 16., 12.],
[ 4., 12., 10.]])Step 5: Apply Softmax
from torch.nn.functional import softmax
attn_scores_softmax = softmax(attn_scores, dim=-1)Softmax output (rounded for readability):
[[0.0, 0.5, 0.5],
[0.0, 1.0, 0.0],
[0.0, 0.9, 0.1]]Step 6: Multiply by V to Obtain Final Output
outputs = attn_scores_softmax @ valuesFinal output tensor:
tensor([[2.0, 7.0, 1.5],
[2.0, 8.0, 0.0],
[2.0, 7.8, 0.3]])Note: The implementation shown here follows a simplified dot‑product attention without the scaling factor used in the original "Scaled Dot‑Product Attention".
Summary
We have walked through the complete self‑attention pipeline: from raw input tensors, through linear projections to queries, keys, and values, to attention‑score computation, softmax normalization, and the final weighted sum. Understanding this process is essential before moving on to multi‑head attention, encoder, and decoder modules in later articles.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.