Demystifying Neural Networks and Transformers: From Basics to Hands‑On Code
This comprehensive guide walks you through the fundamentals of neural networks, explains the evolution to transformer models, provides detailed Python code for training and inference, and shows how to fine‑tune open‑source AI models for real‑world tasks such as automated technical PM prediction.
Introduction
This article provides a step‑by‑step exploration of neural networks, from the basic perceptron to modern transformer architectures, and demonstrates practical implementations using PyTorch and HuggingFace models.
1. Basics of Neural Networks
A neural network mimics the structure of a biological neuron: dendrites receive signals, the cell body processes them, and the axon outputs the result. The brain contains roughly 86 billion neurons, each with thousands of synapses, giving an equivalent computational capacity of about 100 trillion parameters.
2. Training a Simple Neural Network
The following PyTorch code builds a single‑layer linear model and trains it to learn the function f(x)=x₁·w₁+x₂·w₂+b using mean‑squared‑error loss.
from torch import nn
from torch.optim import Adam
import torch
model = nn.Linear(2, 1) # weight matrix 2×1
optimizer = Adam(model.parameters(), lr=1e-1)
loss_fn = nn.MSELoss()
input = torch.randn(10, 2) * 10
bias = 6.6260693
target = torch.add(input.sum(dim=1, keepdim=True), bias)
for epoch in range(100):
pred = model(input)
loss = loss_fn(pred, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 10 == 9:
print(f"Epoch {epoch} | loss: {loss:.2f}")3. Loss Function and Gradient Descent
The loss (mean‑squared error) measures the distance between the model output and the true target. Gradient descent updates the weights w₁, w₂ and bias b in the direction that reduces the loss, scaled by a learning rate.
4. Activation Functions
Activation functions introduce non‑linearity. The sigmoid function maps values to the interval (0, 1) but suffers from vanishing gradients; ReLU, tanh, and ELU are common alternatives.
5. Neural Network Model Evolution
Before transformers, convolutional neural networks (CNN) excelled at image tasks and recurrent neural networks (RNN) handled sequences, but each had limitations (local receptive fields for CNN, sequential processing for RNN). The 2017 "Attention Is All You Need" paper introduced the transformer, which relies on self‑attention to capture global dependencies efficiently.
6. Transformer Architecture
The transformer consists of an encoder and a decoder. Core components include:
Embedding and positional encoding to convert tokens into dense vectors.
Multi‑head self‑attention that computes pairwise token similarity.
Feed‑forward networks and layer normalization.
Residual connections that add the input to the output of each sub‑layer.
The decoder uses masked self‑attention to prevent attending to future tokens and cross‑attention to the encoder output, enabling parallel training with teacher forcing.
7. Practical Implementation with PyTorch
A full transformer implementation is provided, covering positional encoding, multi‑head attention, feed‑forward layers, encoder/decoder stacks, and training loops. The code follows the original paper while using PyTorch modules for efficiency.
class PositionalEncoding(nn.Module):
def __init__(self, d_model, dropout=0.1, max_len=1024):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).float().unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
self.dropout = nn.Dropout(p=dropout)
def forward(self, x):
x = x + self.pe[:x.size(0), ...]
return self.dropout(x)8. Applying Open‑Source Models for Business Tasks
The article shows how to fine‑tune a Chinese mini‑BERT model (hfl/minirbt‑h288) from HuggingFace for a classification task such as predicting the "sub‑technical PM" field in a product‑requirement database.
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW
import torch
checkpoint = "hfl/minirbt-h288"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)
model.to('cuda' if torch.cuda.is_available() else 'cpu')
# Prepare data (example shown in the article)
train_sentences = ["I have a green apple", "apple", 0]
# ... (data loading omitted for brevity)
optimizer = AdamW(model.parameters(), lr=5e-5)
model.train()
for epoch in range(10):
for batch in train_dataloader:
batch = {k: v.to(model.device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()After fine‑tuning, the model can infer the missing PM value for new demand records, demonstrating a practical AI‑assisted workflow.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
