Artificial Intelligence 47 min read

Demystifying Neural Networks and Transformers: From Basics to Hands‑On Code

This comprehensive guide walks you through the fundamentals of neural networks, explains the evolution to transformer models, provides detailed Python code for training and inference, and shows how to fine‑tune open‑source AI models for real‑world tasks such as automated technical PM prediction.

Alibaba Cloud Developer

Nov 22, 2024

Demystifying Neural Networks and Transformers: From Basics to Hands‑On Code

Introduction

This article provides a step‑by‑step exploration of neural networks, from the basic perceptron to modern transformer architectures, and demonstrates practical implementations using PyTorch and HuggingFace models.

1. Basics of Neural Networks

A neural network mimics the structure of a biological neuron: dendrites receive signals, the cell body processes them, and the axon outputs the result. The brain contains roughly 86 billion neurons, each with thousands of synapses, giving an equivalent computational capacity of about 100 trillion parameters.

2. Training a Simple Neural Network

The following PyTorch code builds a single‑layer linear model and trains it to learn the function f(x)=x₁·w₁+x₂·w₂+b using mean‑squared‑error loss.

from torch import nn
from torch.optim import Adam
import torch

model = nn.Linear(2, 1)  # weight matrix 2×1
optimizer = Adam(model.parameters(), lr=1e-1)
loss_fn = nn.MSELoss()

input = torch.randn(10, 2) * 10
bias = 6.6260693
target = torch.add(input.sum(dim=1, keepdim=True), bias)

for epoch in range(100):
    pred = model(input)
    loss = loss_fn(pred, target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch % 10 == 9:
        print(f"Epoch {epoch} | loss: {loss:.2f}")

3. Loss Function and Gradient Descent

The loss (mean‑squared error) measures the distance between the model output and the true target. Gradient descent updates the weights w₁, w₂ and bias b in the direction that reduces the loss, scaled by a learning rate.

4. Activation Functions

Activation functions introduce non‑linearity. The sigmoid function maps values to the interval (0, 1) but suffers from vanishing gradients; ReLU, tanh, and ELU are common alternatives.

5. Neural Network Model Evolution

Before transformers, convolutional neural networks (CNN) excelled at image tasks and recurrent neural networks (RNN) handled sequences, but each had limitations (local receptive fields for CNN, sequential processing for RNN). The 2017 "Attention Is All You Need" paper introduced the transformer, which relies on self‑attention to capture global dependencies efficiently.

6. Transformer Architecture

The transformer consists of an encoder and a decoder. Core components include:

Embedding and positional encoding to convert tokens into dense vectors.

Multi‑head self‑attention that computes pairwise token similarity.

Feed‑forward networks and layer normalization.

Residual connections that add the input to the output of each sub‑layer.

The decoder uses masked self‑attention to prevent attending to future tokens and cross‑attention to the encoder output, enabling parallel training with teacher forcing.

7. Practical Implementation with PyTorch

A full transformer implementation is provided, covering positional encoding, multi‑head attention, feed‑forward layers, encoder/decoder stacks, and training loops. The code follows the original paper while using PyTorch modules for efficiency.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=1024):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).float().unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)
        self.dropout = nn.Dropout(p=dropout)
    def forward(self, x):
        x = x + self.pe[:x.size(0), ...]
        return self.dropout(x)

8. Applying Open‑Source Models for Business Tasks

The article shows how to fine‑tune a Chinese mini‑BERT model (hfl/minirbt‑h288) from HuggingFace for a classification task such as predicting the "sub‑technical PM" field in a product‑requirement database.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW
import torch

checkpoint = "hfl/minirbt-h288"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)
model.to('cuda' if torch.cuda.is_available() else 'cpu')

# Prepare data (example shown in the article)
train_sentences = ["I have a green apple", "apple", 0]
# ... (data loading omitted for brevity)

optimizer = AdamW(model.parameters(), lr=5e-5)
model.train()
for epoch in range(10):
    for batch in train_dataloader:
        batch = {k: v.to(model.device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

After fine‑tuning, the model can infer the missing PM value for new demand records, demonstrating a practical AI‑assisted workflow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning AI deep learning neural networks

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.