Artificial Intelligence 49 min read

Fundamentals and Implementation of Neural Networks and Transformers with PyTorch Examples

This article provides a comprehensive overview of neural network fundamentals, loss functions, activation functions, embedding techniques, attention mechanisms, multi‑head attention, residual networks, and the full Transformer encoder‑decoder architecture, illustrated with detailed PyTorch code and a practical MiniRBT fine‑tuning case for Chinese text classification.

Cognitive Technology Team
Cognitive Technology Team
Cognitive Technology Team
Fundamentals and Implementation of Neural Networks and Transformers with PyTorch Examples

The article begins by introducing the basic principles of neural networks, describing how a single neuron computes a linear function f(x)=x·w+b and how multiple neurons are stacked to form shallow and deep networks.

It then explains the training process using gradient descent and mean‑square error loss, showing a concrete example where a linear model learns the relationship f(x)=x1·w1+x2·w2+b from synthetic data. The training loop, convergence criteria, and parameter updates are illustrated with PyTorch code.

Activation functions such as Sigmoid, ReLU, and tanh are discussed, highlighting their role in introducing non‑linearity and addressing gradient‑vanishing problems.

Next, the article covers input preprocessing (embedding), describing how tokens are converted into dense vectors, optionally combined with positional encodings, and how these embeddings are the foundation for downstream models.

The core of the article focuses on the attention mechanism and multi‑head attention. It explains the Q‑K‑V formulation, scaled dot‑product attention, and how multiple heads allow the model to attend to information from different representation subspaces.

Residual connections and layer normalization are introduced as essential components for training deep networks without degradation, forming the basis of the Feed‑Forward Neural Network (FFNN) sub‑layer.

All these components are assembled into the Transformer architecture. The encoder consists of stacked self‑attention and FFNN layers, while the decoder adds masked self‑attention and encoder‑decoder attention, enabling tasks such as translation, text generation, and classification.

The training pipeline is described step‑by‑step, including teacher‑forcing, masking strategies, and loss computation. The prediction phase is shown as an iterative token‑by‑token generation until an end‑of‑sequence token is produced.

To illustrate practical usage, the article presents a complete PyTorch implementation of a Transformer model, including data preparation, tokenization, model definition, training loop, and inference. The full code is provided below:

import torch
from torch import nn
from torch import optim
from torch.utils import data as Data
import numpy as np

d_model = 6 # Embedding size
max_len = 1024 # Maximum sequence length
d_ff = 12 # Feed‑forward hidden size
d_k = d_v = 3 # Dimensions for Q, K, V
n_layers = 1 # Number of encoder/decoder layers
n_heads = 8 # Number of attention heads
p_drop = 0.1 # Dropout probability

# Mask for padding tokens
def get_attn_pad_mask(seq_q, seq_k):
  batch, len_q = seq_q.size()
  batch, len_k = seq_k.size()
  pad_attn_mask = seq_k.data.eq(0).unsqueeze(1) # [batch, 1, len_k]
  return pad_attn_mask.expand(batch, len_q, len_k) # [batch, len_q, len_k]

# Subsequent mask for decoder
def get_attn_subsequent_mask(seq):
  attn_shape = [seq.size(0), seq.size(1), seq.size(1)]
  subsequent_mask = np.triu(np.ones(attn_shape), k=1)
  subsequent_mask = torch.from_numpy(subsequent_mask)
  return subsequent_mask

class PositionalEncoding(nn.Module):
  def __init__(self, d_model, dropout=.1, max_len=1024):
    super(PositionalEncoding, self).__init__()
    self.dropout = nn.Dropout(p=p_drop)
    positional_encoding = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).float().unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.Tensor([10000])) / d_model))
    positional_encoding[:, 0::2] = torch.sin(position * div_term)
    positional_encoding[:, 1::2] = torch.cos(position * div_term)
    positional_encoding = positional_encoding.unsqueeze(0).transpose(0, 1)
    self.register_buffer('pe', positional_encoding)
  def forward(self, x):
    x = x + self.pe[:x.size(0), ...]
    return self.dropout(x)

class FeedForwardNetwork(nn.Module):
  def __init__(self):
    super(FeedForwardNetwork, self).__init__()
    self.ff1 = nn.Linear(d_model, d_ff)
    self.ff2 = nn.Linear(d_ff, d_model)
    self.relu = nn.ReLU()
    self.dropout = nn.Dropout(p=p_drop)
    self.layer_norm = nn.LayerNorm(d_model)
  def forward(self, x):
    x = self.ff1(x)
    x = self.relu(x)
    x = self.ff2(x)
    return self.layer_norm(x)

class ScaledDotProductAttention(nn.Module):
  def __init__(self):
    super(ScaledDotProductAttention, self).__init__()
  def forward(self, Q, K, V, attn_mask):
    scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k)
    scores.masked_fill_(attn_mask, -1e9)
    attn = nn.Softmax(dim=-1)(scores)
    prob = torch.matmul(attn, V)
    return prob, attn

class MultiHeadAttention(nn.Module):
  def __init__(self, n_heads=8):
    super(MultiHeadAttention, self).__init__()
    self.n_heads = n_heads
    self.W_Q = nn.Linear(d_model, d_k * n_heads, bias=False)
    self.W_K = nn.Linear(d_model, d_k * n_heads, bias=False)
    self.W_V = nn.Linear(d_model, d_v * n_heads, bias=False)
    self.fc = nn.Linear(d_v * n_heads, d_model, bias=False)
    self.layer_norm = nn.LayerNorm(d_model)
  def forward(self, input_Q, input_K, input_V, attn_mask):
    residual, batch = input_Q, input_Q.size(0)
    Q = self.W_Q(input_Q).view(batch, -1, n_heads, d_k).transpose(1, 2)
    K = self.W_K(input_K).view(batch, -1, n_heads, d_k).transpose(1, 2)
    V = self.W_V(input_V).view(batch, -1, n_heads, d_v).transpose(1, 2)
    attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1)
    prob, attn = ScaledDotProductAttention()(Q, K, V, attn_mask)
    prob = prob.transpose(1, 2).contiguous().view(batch, -1, n_heads * d_v)
    output = self.fc(prob)
    return self.layer_norm(residual + output), attn

class EncoderLayer(nn.Module):
  def __init__(self):
    super(EncoderLayer, self).__init__()
    self.encoder_self_attn = MultiHeadAttention()
    self.ffn = FeedForwardNetwork()
  def forward(self, encoder_input, encoder_pad_mask):
    encoder_output, attn = self.encoder_self_attn(encoder_input, encoder_input, encoder_input, encoder_pad_mask)
    encoder_output = self.ffn(encoder_output)
    return encoder_output, attn

class Encoder(nn.Module):
  def __init__(self):
    super(Encoder, self).__init__()
    self.source_embedding = nn.Embedding(source_vocab_size, d_model)
    self.positional_embedding = PositionalEncoding(d_model)
    self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)])
  def forward(self, encoder_input):
    encoder_output = self.source_embedding(encoder_input)
    encoder_output = self.positional_embedding(encoder_output.transpose(0,1)).transpose(0,1)
    encoder_self_attn_mask = get_attn_pad_mask(encoder_input, encoder_input)
    encoder_self_attns = []
    for layer in self.layers:
      encoder_output, encoder_self_attn = layer(encoder_output, encoder_self_attn_mask)
      encoder_self_attns.append(encoder_self_attn)
    return encoder_output, encoder_self_attns

class DecoderLayer(nn.Module):
  def __init__(self):
    super(DecoderLayer, self).__init__()
    self.decoder_self_attn = MultiHeadAttention()
    self.encoder_decoder_attn = MultiHeadAttention()
    self.ffn = FeedForwardNetwork()
  def forward(self, decoder_input, encoder_output, decoder_self_mask, decoder_encoder_mask):
    decoder_output, decoder_self_attn = self.decoder_self_attn(decoder_input, decoder_input, decoder_input, decoder_self_mask)
    decoder_output, decoder_encoder_attn = self.encoder_decoder_attn(decoder_output, encoder_output, encoder_output, decoder_encoder_mask)
    decoder_output = self.ffn(decoder_output)
    return decoder_output, decoder_self_attn, decoder_encoder_attn

class Decoder(nn.Module):
  def __init__(self):
    super(Decoder, self).__init__()
    self.target_embedding = nn.Embedding(target_vocab_size, d_model)
    self.positional_embedding = PositionalEncoding(d_model)
    self.layers = nn.ModuleList([DecoderLayer() for _ in range(n_layers)])
  def forward(self, decoder_input, encoder_input, encoder_output):
    decoder_output = self.target_embedding(decoder_input)
    decoder_output = self.positional_embedding(decoder_output.transpose(0,1)).transpose(0,1)
    decoder_self_attn_mask = get_attn_pad_mask(decoder_input, decoder_input)
    decoder_subsequent_mask = get_attn_subsequent_mask(decoder_input)
    decoder_encoder_attn_mask = get_attn_pad_mask(decoder_input, encoder_input)
    decoder_self_mask = torch.gt(decoder_self_attn_mask + decoder_subsequent_mask, 0)
    decoder_self_attns, decoder_encoder_attns = [], []
    for layer in self.layers:
      decoder_output, decoder_self_attn, decoder_encoder_attn = layer(decoder_output, encoder_output, decoder_self_mask, decoder_encoder_attn_mask)
      decoder_self_attns.append(decoder_self_attn)
      decoder_encoder_attns.append(decoder_encoder_attn)
    return decoder_output, decoder_self_attns, decoder_encoder_attns

class Transformer(nn.Module):
  def __init__(self):
    super(Transformer, self).__init__()
    self.encoder = Encoder()
    self.decoder = Decoder()
    self.fc = nn.Linear(d_model, target_vocab_size, bias=False)
  def forward(self, encoder_input, decoder_input):
    encoder_output, encoder_attns = self.encoder(encoder_input)
    decoder_output, decoder_self_attns, decoder_encoder_attns = self.decoder(decoder_input, encoder_input, encoder_output)
    decoder_logits = self.fc(decoder_output)
    return decoder_logits.view(-1, decoder_logits.size(-1))

Finally, the article demonstrates how to fine‑tune a pre‑trained MiniRBT model from Hugging Face for a Chinese text‑classification task, covering data loading, tokenizer setup, training loop with AdamW optimizer, evaluation using the Trainer API, and inference on new sentences.

The comprehensive walkthrough equips readers with both theoretical understanding and practical code to build, train, and deploy neural‑network‑based models, especially Transformers, for a wide range of AI applications.

machine learningAIdeep learningTransformerneural networksPyTorch
Cognitive Technology Team
Written by

Cognitive Technology Team

Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.