Artificial Intelligence 19 min read

Understanding Compact Transformers: Build and Train Vision & NLP Models on a Personal PC

This article walks through the design of Compact Transformers, explaining scaled dot‑product self‑attention, positional embeddings, multi‑head attention, and Vision Transformer architecture, and provides full PyTorch code so readers can train lightweight CV and NLP classifiers on a single PC.

Code DAO

Dec 8, 2021

Understanding Compact Transformers: Build and Train Vision & NLP Models on a Personal PC

The piece introduces Compact Transformers, a lightweight variant of Vision Transformers that can be trained on a personal computer to obtain computer‑vision (CV) and natural‑language‑processing (NLP) classification results.

It begins with a concise review of the attention mechanism, focusing on Scaled Dot‑Product Self‑Attention. The query (q) and key (k) vectors are compared via a dot product, scaled by \(\sqrt{d}\) to keep the softmax inputs stable, and passed through a softmax that yields attention scores between 0 and 1.

class SelfAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.embed_dim = embed_dim
        self.scale = embed_dim ** -0.5
        self.qkv = nn.Linear(embed_dim, embed_dim * 3, bias=False)
    def forward(self, x):
        B, N, _ = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.embed_dim).permute(1, 0, 2, 3)
        q, k, v = qkv[0], qkv[1], qkv[2]
        attn = torch.bmm(q, k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        x = torch.bmm(attn, v).transpose(1, 2).reshape(B, N, self.embed_dim)
        return x

The article then explains sinusoidal positional embeddings and provides their implementation:

class PositionalEmbedding(nn.Module):
    def __init__(self, embedding_dim, max_len=5000, freq=10000.):
        super().__init__()
        pe = torch.zeros(max_len, embedding_dim)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div = torch.exp(torch.arange(0, embedding_dim, 2).float() * (-math.log(freq) / embedding_dim))
        pe[:, 0::2] = torch.sin(position * div)
        pe[:, 1::2] = torch.cos(position * div)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)
    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return x

Multi‑head attention is built by splitting the embedding dimension across heads, adding dropout and a projection layer:

class MultiheadedSelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads=12, attn_dropout=0., proj_dropout=0.):
        super().__init__()
        assert embed_dim % num_heads == 0, "Embedding dim must be divisible by number of heads."
        head_dim = embed_dim // num_heads
        self.scale = head_dim ** -0.5
        self.num_heads = num_heads
        self.qkv = nn.Linear(embed_dim, embed_dim * 3)
        self.attn_dropout = nn.Dropout(attn_dropout)
        self.projection = nn.Linear(embed_dim, embed_dim)
        self.proj_dropout = nn.Dropout(proj_dropout)
    def forward(self, x):
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = torch.chunk(qkv, 3, dim=0)
        attn = torch.bmm(q.squeeze(0), k.squeeze(0).transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_dropout(attn)
        x = torch.bmm(attn, v.squeeze(0)).transpose(1, 2).reshape(B, N, C)
        x = self.projection(x)
        x = self.proj_dropout(x)
        return x

A Transformer encoder layer combines the multi‑head attention, a feed‑forward network, layer‑norms, and residual connections:

class TransformerEncoderLayer(nn.Module):
    def __init__(self, embed_dim=768, num_heads=12, attn_dropout=0., proj_dropout=0., mlp_dropout=0.1, feedforward_dim=3072):
        super().__init__()
        self.norm_1 = nn.LayerNorm(embed_dim)
        self.norm_2 = nn.LayerNorm(embed_dim)
        self.MHA = MultiheadedSelfAttention(embed_dim, num_heads, attn_dropout, proj_dropout)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, feedforward_dim),
            nn.GELU(),
            nn.Dropout(mlp_dropout),
            nn.Linear(feedforward_dim, embed_dim),
            nn.Dropout(mlp_dropout)
        )
    def forward(self, x):
        mha = self.MHA(self.norm_1(x))
        x = x + mha
        x = self.norm_2(x)
        x2 = self.ff(x)
        x = x + x2
        return x

Using these blocks, the Vision Transformer (ViT) is assembled. Images are split into 16×16 patches (or 4×4 for CIFAR‑10), flattened, and linearly projected. A learnable class token is prepended, sinusoidal positional embeddings are added, and a stack of encoder layers processes the sequence. The final class token is passed through a linear head for classification.

class ViT(nn.Module):
    def __init__(self, img_size=224, in_channels=3, patch_size=16, embed_dim=768, num_layers=12, num_heads=12, attn_dropout=0., proj_dropout=0., mlp_dropout=0.1, mlp_ratio=4, n_classes=1000):
        super().__init__()
        assert img_size % patch_size == 0, "Image size must be divisible by patch size."
        self.patchAndEmbed = Patching(in_channels, patch_size, embed_dim)
        seq_len = (img_size // patch_size) ** 2
        self.class_embed = nn.Parameter(torch.zeros(1, 1, embed_dim), requires_grad=True)
        self.pe = nn.Parameter(torch.zeros(1, seq_len + 1, embed_dim), requires_grad=True)
        hidden_dim = int(embed_dim * mlp_ratio)
        self.transformerEncoder = nn.Sequential(*[TransformerEncoderLayer(embed_dim, num_heads, attn_dropout, proj_dropout, mlp_dropout, hidden_dim) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(embed_dim)
        self.mlp = nn.Linear(embed_dim, n_classes)
    def forward(self, x):
        x = self.patchAndEmbed(x)
        class_token = self.class_embed.expand(x.shape[0], -1, -1)
        x = torch.cat((class_token, x), dim=1)
        x = x + self.pe
        x = self.transformerEncoder(x)
        x = x[:, 0]
        x = self.mlp(self.norm(x))
        return x

The article then presents Compact Vision Transformers (CVT) from the paper "Escaping the Big Data Paradigm with Compact Transformers". CVT adds a sequence‑pooling (SeqPool) linear layer that computes a softmax over the encoder output and aggregates the weighted sum, allowing the model to focus on the most informative tokens.

class TransformerClassifier(nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__()
        ...
        self.attention_pool = nn.Linear(self.embedding_dim, 1)
        ...
    def forward(self, x):
        x = torch.matmul(F.softmax(self.attention_pool(x), dim=1).transpose(-1, -2), x).squeeze(-2)
        ...

To mitigate the lack of CNN‑style inductive bias, the author introduces overlapping convolutions and max‑pooling in a Tokenizer module, providing relational bias that helps the transformer learn better embeddings for visual data.

class Tokenizer(nn.Module):
    def __init__(self, kernel_size, stride, padding, pooling_kernel_size=3, pooling_stride=2, pooling_padding=1, n_conv_layers=1, n_input_channels=3, n_output_channels=64, in_planes=64):
        super().__init__()
        n_filter_list = [n_input_channels] + [in_planes for _ in range(n_conv_layers-1)] + [n_output_channels]
        self.conv_layers = nn.Sequential(*[nn.Sequential(nn.Conv2d(n_filter_list[i], n_filter_list[i+1], kernel_size=kernel_size, stride=stride, padding=padding, bias=False), nn.ReLU(), nn.MaxPool2d(kernel_size=pooling_kernel_size, stride=pooling_stride, padding=pooling_padding)) for i in range(n_conv_layers)])
        self.flattener = nn.Flatten(2, 3)
    def forward(self, x):
        return self.flattener(self.conv_layers(x)).transpose(-2, -1)

"Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data."

The article notes that large‑scale pre‑training (e.g., on 14M‑300M images) improves performance, while smaller datasets can cause over‑fitting. It also shows that ViT‑Base has ~86 M parameters compared with ~23 M for ResNet‑50.

Finally, the author mentions that the same architecture can be applied to NLP tasks, using word embeddings such as GloVe, and provides a simple WordEmbedder class.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PyTorch Self-Attention Multi-Head Attention Vision Transformers Patch Embedding Positional Embedding Compact Transformers

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.