Artificial Intelligence 12 min read

A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers

This article introduces the fundamentals of Vision Transformers (ViT) for computer‑vision developers, starting with an overview of the transformer architecture, detailed explanation of self‑attention and multi‑head attention, and step‑by‑step PyTorch code examples that illustrate query, key, value computation and attention scoring.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers

A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers

Introduction

In recent years Vision Transformers (ViT) have dominated many benchmarks in image classification, object detection, and semantic segmentation, much like ResNet did in 2015. This article is the first in a three‑part series that aims to guide computer‑vision programmers, who may be unfamiliar with NLP, through the fundamentals of transformers and their application to vision.

Part 1: Introduction to the transformer architecture in the NLP domain – the essential prerequisite for understanding ViT.

Part 2: Detailed explanation of ViT, i.e., how the transformer model is applied to visual data.

Part 3: Walk‑through of ViT code, demystifying the implementation and encouraging hands‑on experimentation.

Let’s begin the deep dive into transformers!

Overall Framework

Before presenting the transformer framework, it is useful to understand why the transformer structure is advantageous. In NLP, prior to transformers the dominant models were RNNs and LSTMs, both of which suffer from poor parallelism because each time step depends on the previous one. For example, translating the sentence "I Love China" to "我爱中国" with an RNN requires sequential generation of each token, making parallel computation impossible.

The transformer solves this limitation by allowing all positions to be processed simultaneously, which will be explained in detail later.

The overall transformer architecture is shown below (image extracted from the original paper):

The diagram may look intimidating at first glance, but the subsequent sections will break it down step by step.

Self‑Attention

The self‑attention module is the core of the transformer. The following steps describe its execution, accompanied by code snippets and visual illustrations.

Step 1: Prepare Input

import torch
x = [
  [1, 0, 1, 0],  # Input 1
  [0, 2, 0, 2],  # Input 2
  [1, 1, 1, 1]   # Input 3
]
x = torch.tensor(x, dtype=torch.float32)

The resulting tensor is:

tensor([[1., 0., 1., 0.],
        [0., 2., 0., 2.],
        [1., 1., 1., 1.]])

Step 2: Initialize Weight Matrices

w_query = [
  [1, 0, 1],
  [1, 0, 0],
  [0, 0, 1],
  [0, 1, 1]
]

w_key = [
  [0, 0, 1],
  [1, 1, 0],
  [0, 1, 0],
  [1, 1, 0]
]

w_value = [
  [0, 2, 0],
  [0, 3, 0],
  [1, 0, 3],
  [1, 1, 0]
]

w_query = torch.tensor(w_query, dtype=torch.float32)
w_key   = torch.tensor(w_key,   dtype=torch.float32)
w_value = torch.tensor(w_value, dtype=torch.float32)

Step 3: Generate Q, K, V

querys = x @ w_query
keys   = x @ w_key
values = x @ w_value

Resulting tensors:

# Q
tensor([[1., 0., 2.],
        [2., 2., 2.],
        [2., 1., 3.]])
# K
tensor([[0., 1., 1.],
        [4., 4., 0.],
        [2., 3., 1.]])
# V
tensor([[1., 2., 3.],
        [2., 8., 0.],
        [2., 6., 3.]])

Step 4: Compute Attention Scores

attn_scores = querys @ keys.T

Attention scores:

tensor([[ 2.,  4.,  4.],
        [ 4., 16., 12.],
        [ 4., 12., 10.]])

Step 5: Apply Softmax

from torch.nn.functional import softmax
attn_scores_softmax = softmax(attn_scores, dim=-1)

Softmax output (rounded for readability):

[[0.0, 0.5, 0.5],
 [0.0, 1.0, 0.0],
 [0.0, 0.9, 0.1]]

Step 6: Multiply by V to Obtain Final Output

outputs = attn_scores_softmax @ values

Final output tensor:

tensor([[2.0, 7.0, 1.5],
        [2.0, 8.0, 0.0],
        [2.0, 7.8, 0.3]])

Note: The implementation shown here follows a simplified dot‑product attention without the scaling factor used in the original "Scaled Dot‑Product Attention".

Summary

We have walked through the complete self‑attention pipeline: from raw input tensors, through linear projections to queries, keys, and values, to attention‑score computation, softmax normalization, and the final weighted sum. Understanding this process is essential before moving on to multi‑head attention, encoder, and decoder modules in later articles.

computer visiontransformerPyTorchSelf-AttentionVision Transformer
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.