What Interviewers Expect: Understanding Transformers Beyond Codex and AI Code Generation
The article explains why modern interviewers ask about Transformer fundamentals, breaks down its core components such as self‑attention, multi‑head attention, feed‑forward networks, residual connections and positional encodings, and demonstrates a complete PyTorch toy model that predicts the sum‑mod‑10 of integer sequences while visualizing loss curves, attention heatmaps, embedding PCA and early‑stage gradient norms.
Transformer intuition
Each token in a sequence can attend to every other token. Self‑Attention computes a relevance score for every pair, normalises the scores with a softmax, and uses the resulting weights to aggregate information. Multiple attention heads run in parallel (Multi‑Head Attention) to capture different patterns. After attention, a position‑wise feed‑forward network (FFN) adds non‑linearity, and a residual connection followed by LayerNorm stabilises training. Positional encodings inject order information because attention itself is order‑agnostic. Masks hide future tokens during generation or ignore padding.
Key modules
Self‑Attention (scaled dot‑product) : Queries, Keys and Values are linear projections of the input. The dot product QKᵀ is scaled by √dₖ to keep the softmax distribution smooth, then multiplied by V.
Multi‑Head Attention : The projection matrices are split into several heads, each learning a different view (e.g., short‑range vs. long‑range). Heads are concatenated and linearly projected back to the model dimension.
Feed‑Forward Network : Two linear layers with a non‑linear activation (GELU or ReLU) applied independently to each position.
Residual + LayerNorm : Adds the sub‑layer input to its output and normalises the sum, preventing degradation as depth grows.
Positional Encoding : Sinusoidal or learnable embeddings added to token embeddings to convey position.
Masking : Prevents attention to future positions (causal mask) or padding tokens.
Core formulas
1. Scaled Dot‑Product Attention : Attention(Q,K,V)=softmax((QK^T)/sqrt(d_k))·V 2. Multi‑Head Attention concatenates the outputs of several scaled‑dot‑product attentions and projects the result back to the model dimension.
3. Positional Encoding uses sine and cosine functions of varying frequencies.
4. Cross‑Entropy Loss for classification tasks.
Toy example: sum mod 10 classification
The task is to predict the class (sum of a fixed‑length integer sequence) mod 10 . This requires the model to aggregate information from all positions, making it suitable for visualising attention.
import math, random, numpy as np, torch, torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt, seaborn as sns
from sklearn.decomposition import PCA
import pandas as pd
# reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
# hyper‑parameters
vocab_size = 100
d_model = 64
num_heads = 4
num_layers = 2
ffn_dim = 128
seq_len = 16
num_classes = 10
batch_size = 64
epochs = 8
lr = 1e-3
device = torch.device('cpu')
class SumModDataset(Dataset):
def __init__(self, size, seq_len, vocab_size, num_classes):
xs = np.random.randint(0, vocab_size, size=(size, seq_len), dtype=np.int64)
ys = xs.sum(axis=1) % num_classes
self.x = torch.tensor(xs, dtype=torch.long)
self.y = torch.tensor(ys, dtype=torch.long)
def __len__(self):
return len(self.x)
def __getitem__(self, idx):
return self.x[idx], self.y[idx]
train_ds = SumModDataset(4096, seq_len, vocab_size, num_classes)
val_ds = SumModDataset(512, seq_len, vocab_size, num_classes)
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False)
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=512):
super().__init__()
pe = torch.zeros(max_len, d_model)
pos = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
denom = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(pos * denom)
pe[:, 1::2] = torch.cos(pos * denom)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
L = x.size(1)
return x + self.pe[:, :L, :]
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, ffn_dim, dropout=0.1):
super().__init__()
self.mha = nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True)
self.ffn = nn.Sequential(nn.Linear(d_model, ffn_dim), nn.GELU(), nn.Dropout(dropout), nn.Linear(ffn_dim, d_model))
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.drop = nn.Dropout(dropout)
def forward(self, x, attn_mask=None, need_weights=False):
residual = x
attn_out, attn_weights = self.mha(x, x, x, attn_mask=attn_mask, need_weights=need_weights, average_attn_weights=False)
x = self.norm1(residual + self.drop(attn_out))
residual = x
x = self.ffn(x)
x = self.norm2(residual + self.drop(x))
return x, attn_weights
class TinyTransformer(nn.Module):
def __init__(self, vocab_size, d_model, num_heads, num_layers, ffn_dim, num_classes, dropout=0.1, seq_len=16):
super().__init__()
self.emb = nn.Embedding(vocab_size, d_model)
self.pos = PositionalEncoding(d_model)
self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, ffn_dim, dropout) for _ in range(num_layers)])
self.pool = nn.AdaptiveAvgPool1d(1)
self.head = nn.Linear(d_model, num_classes)
def forward(self, x, need_attn=False):
x = self.emb(x)
x = self.pos(x)
attn_collect = []
for layer in self.layers:
x, attn = layer(x, need_weights=need_attn)
if need_attn:
attn_collect.append(attn)
x = x.transpose(1, 2) # [B, d, L]
x = self.pool(x).squeeze(-1) # [B, d]
logits = self.head(x)
return (logits, attn_collect) if need_attn else logits
model = TinyTransformer(vocab_size, d_model, num_heads, num_layers, ffn_dim, num_classes, dropout=0.1, seq_len=seq_len).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-2)
train_losses, val_losses = [], []
saved_attn = None
saved_tokens = None
def param_to_layer(name):
if name.startswith('emb'): return 'emb'
if name.startswith('layers.0.mha'): return 'layer0_attn'
if name.startswith('layers.0.ffn'): return 'layer0_ffn'
if name.startswith('layers.1.mha'): return 'layer1_attn'
if name.startswith('layers.1.ffn'): return 'layer1_ffn'
if name.startswith('head'): return 'head'
return 'others'
grad_records = []
max_steps_for_grad = 120
global_step = 0
def record_grad(step):
for n, p in model.named_parameters():
if p.grad is None:
continue
layer = param_to_layer(n)
norm = p.grad.data.norm().item()
grad_records.append({'layer': layer, 'step': step, 'norm': norm})
for epoch in range(1, epochs + 1):
model.train()
epoch_losses = []
for xb, yb in train_loader:
xb, yb = xb.to(device), yb.to(device)
optimizer.zero_grad()
logits = model(xb)
loss = criterion(logits, yb)
loss.backward()
if global_step < max_steps_for_grad:
record_grad(global_step)
optimizer.step()
epoch_losses.append(loss.item())
global_step += 1
if saved_attn is None:
model.eval()
with torch.no_grad():
logits2, attn_list = model(xb[:1], need_attn=True)
saved_attn = [a.cpu().numpy() for a in attn_list]
saved_tokens = xb[:1].cpu().numpy()
model.train()
train_losses.append(np.mean(epoch_losses))
model.eval()
with torch.no_grad():
val_epoch_losses = []
for xb, yb in val_loader:
xb, yb = xb.to(device), yb.to(device)
logits = model(xb)
loss = criterion(logits, yb)
val_epoch_losses.append(loss.item())
val_losses.append(np.mean(val_epoch_losses))
print(f"Epoch {epoch:02d} | train_loss={train_losses[-1]:.4f}, val_loss={val_losses[-1]:.4f}")
# Visualisation: loss curves
plt.figure(figsize=(8,5))
plt.plot(train_losses, color='magenta', lw=2.5, marker='o', label='Train Loss')
plt.plot(val_losses, color='cyan', lw=2.5, marker='s', label='Val Loss')
plt.title('Loss Curves (Train vs Val)')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Visualisation: attention heatmaps (first layer, four heads)
if saved_attn is not None:
attn_l0 = saved_attn[0][0] # [heads, L, L]
cmaps = ['plasma','magma','viridis','inferno']
fig, axes = plt.subplots(1, num_heads, figsize=(4*num_heads,4))
fig.suptitle('Attention Heatmaps (Layer 0, 4 Heads)', fontsize=16)
for h in range(num_heads):
ax = axes[h]
im = ax.imshow(attn_l0[h], cmap=cmaps[h], vmin=0, vmax=1)
ax.set_title(f'Head {h}', fontsize=12)
ax.set_xlabel('Key Positions')
ax.set_ylabel('Query Positions')
plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
plt.tight_layout()
plt.show()
# Visualisation: token embedding PCA
emb_matrix = model.emb.weight.detach().cpu().numpy()
emb_2d = PCA(n_components=2, random_state=42).fit_transform(emb_matrix)
colors = plt.cm.hsv(np.linspace(0,1,vocab_size))
plt.figure(figsize=(6,6))
plt.scatter(emb_2d[:,0], emb_2d[:,1], c=colors, s=30, alpha=0.9, edgecolors='k', linewidths=0.3)
plt.title('Token Embeddings (PCA to 2D) - HSV Colors')
plt.xlabel('PC1')
plt.ylabel('PC2')
for tid in range(0, vocab_size, 10):
plt.text(emb_2d[tid,0]+0.02, emb_2d[tid,1]+0.02, str(tid), fontsize=8, color='black')
plt.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()
# Visualisation: gradient norm distribution (early steps)
df_grad = pd.DataFrame(grad_records)
layer_order = ['emb','layer0_attn','layer0_ffn','layer1_attn','layer1_ffn','head']
df_grad['layer'] = pd.Categorical(df_grad['layer'], categories=layer_order, ordered=True)
plt.figure(figsize=(10,4))
sns.violinplot(data=df_grad, x='layer', y='norm', inner='quartile', palette='Set2', cut=0, scale='width')
plt.title('Gradient Norm Distribution by Layer (Early Steps)')
plt.xlabel('Layer')
plt.ylabel('Grad Norm')
plt.tight_layout()
plt.show()
<img src="https://mmbiz.qpic.cn/sz_mmbiz_png/gTp499JicyvBLoykCRnmvRuE1CNvHgP8zkkokwN2gn5qcfTVtcbBAxdEbBw86k3oRmulxxX6C5n9a0GQvZnnrg6gGwSe1moeicba4tqJhTXEs/640?wx_fmt=png"/>
<p>Loss curves show steady reduction of both training and validation loss, indicating that the model learns to aggregate global information without over‑fitting.</p>
<img src="https://mmbiz.qpic.cn/sz_mmbiz_png/gTp499JicyvAFibEOEliclrH6ic2NN37xqhUSZF4ABxpELM52JFicwgxgwtRdiaJFycgdNWVJttGu6QSlRMpfSZJe0r6S8p0FiapqMzxq3TF0mjUz4/640?wx_fmt=png"/>
<p>Attention heatmaps visualise the query‑to‑key weight distribution for each head. Different heads focus on different patterns (some uniform, some local), illustrating the multi‑view nature of Multi‑Head Attention.</p>
<img src="https://mmbiz.qpic.cn/sz_mmbiz_png/gTp499JicyvDMotTwMmicwr6FJNfzpTvxqvHiaxMxHZgQVSHbeJNYobqeEsrxNaGm9nYUwBdNoXI0Znsvic8LztTAGEqicXQAl8w5NjOumZvsPHk/640?wx_fmt=png"/>
<p>PCA of token embeddings shows how the model clusters tokens that are useful for distinguishing the ten classes. Even though inputs are random integers, training pulls together tokens with similar contribution to the modulo‑10 label.</p>
<img src="https://mmbiz.qpic.cn/sz_mmbiz_png/gTp499JicyvD33JPMnC7y2tAwjIOKpKxhnTicwr5z8fLJzmrfbnbfG00wqBImch7CicP72Kk6cpZDxFIQAdJwsAdODwaxNPxia0Hic3F59G5ztR4/640?wx_fmt=png"/>
<p>Gradient‑norm violin plots reveal that early training steps produce different gradient magnitudes across layers. Larger norms in attention layers versus FFN layers, or smaller norms in the embedding layer, can guide learning‑rate or layer‑wise optimisation decisions.</p>Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
