Master 20 Essential PyTorch Concepts: From Tensors to Model Deployment
This guide walks you through 20 fundamental PyTorch concepts—including tensor creation, operations, autograd, model building, data loading, GPU acceleration, and best‑practice tricks—providing clear code snippets and step‑by‑step explanations so you can quickly prototype, train, and deploy neural networks.
01 PyTorch Basics: Tensors
PyTorch's core data structure is the torch.Tensor, a GPU‑optimized analogue of a NumPy array. Tensors can be created from Python lists, with explicit shapes, or filled with zeros, ones, uniform random values, or values drawn from a standard normal distribution.
import torch
x = torch.tensor([1, 2, 3])
zeros_tensor = torch.zeros(3, 2) # 3×2 zero tensor
ones_tensor = torch.ones(3, 2) # 3×2 ones tensor
random_tensor = torch.rand(2, 2) # uniform random
normal_tensor = torch.randn(2, 2) # standard normalTensors interoperate seamlessly with NumPy arrays, enabling smooth integration with the broader scientific‑computing ecosystem.
import numpy as np
x_numpy = np.array([0.1, 0.2, 0.3])
x_torch = torch.from_numpy(x_numpy)
y_torch = torch.tensor([3, 4, 5.])
y_numpy = y_torch.numpy()02 Tensor Operations
Key manipulation methods include view (requires contiguous memory), reshape (works regardless), unsqueeze, squeeze, transpose, and permute. Understanding contiguity prevents runtime errors when reshaping.
x = torch.randn(2, 3, 4)
print(x.is_contiguous()) # True
# view works on contiguous tensor
y = x.view(6, 4)
# after transpose the tensor becomes non‑contiguous
x_t = x.transpose(1, 2)
print(x_t.is_contiguous()) # False
# safe reshaping
y = x_t.reshape(6, 4)
# dimension tricks
x = torch.randn(3, 4)
x_unsq = x.unsqueeze(0) # (1, 3, 4)
x_sq = x_unsq.squeeze(0) # back to (3, 4)
y = x.transpose(0, 1) # swap dim 0 and 1
z = x.permute(1, 0, 2) # custom order03 Automatic Differentiation (autograd)
Operations on tensors with requires_grad=True are recorded in a computation graph. Calling backward() computes gradients automatically.
x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 2*x + 2
y.backward()
print(f"Derivative at x=2: {x.grad}") # 6.0For multivariate functions gradients accumulate by default; clear them each iteration with optimizer.zero_grad() (or tensor.grad.zero_()).
def g(w):
return 2*w[0]*w[1] + w[1]*torch.cos(w[0])
w = torch.tensor([3.14, 1.0], requires_grad=True)
z = g(w)
z.backward()
print(f"Gradients: {w.grad}") # approx [2.0, 6.28]04 Building Neural Networks
Two primary patterns:
Subclass nn.Module for full flexibility (custom forward logic).
Stack layers with nn.Sequential for quick prototypes.
import torch.nn as nn
class NeuralNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.hidden = nn.Linear(input_size, hidden_size)
self.predict = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = torch.relu(self.hidden(x))
return self.predict(x)
model_seq = nn.Sequential(
nn.Linear(10, 50),
nn.ReLU(),
nn.Linear(50, 20),
nn.ReLU(),
nn.Linear(20, 1)
)05 Core Components
Activation functions introduce non‑linearity. Common choices: ReLU: max(0, x) – simple, mitigates vanishing gradients; used in most hidden layers. Sigmoid: 1/(1+e^{-x}) – outputs (0,1); typical for binary classification output. Tanh: (e^x‑e^{-x})/(e^x+e^{-x}) – outputs (-1,1); often used in RNN hidden layers. Leaky ReLU: max(αx, x) – alleviates dead‑neuron problem; alternative to ReLU.
Loss functions depend on the task:
# Regression
mse_loss = nn.MSELoss()
mae_loss = nn.L1Loss()
# Classification
ce_loss = nn.CrossEntropyLoss()
bce_loss = nn.BCELoss()Optimizers control parameter updates. Frequently used variants:
import torch.optim as optim
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer_adam = optim.Adam(model.parameters(), lr=0.001)
optimizer_adamw = optim.AdamW(model.parameters(), lr=0.001)
optimizer_rms = optim.RMSprop(model.parameters(), lr=0.001)06 Training Loop
A typical training pipeline combines model, loss, optimizer, and an epoch loop.
# Prepare components
model = NeuralNet(input_size=10, hidden_size=20, output_size=1)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(100):
outputs = model(train_data)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f'Epoch {epoch}, Loss: {loss.item():.4f}')07 Data Processing: Dataset & DataLoader
Custom Dataset subclasses define how individual samples are fetched; DataLoader handles batching, shuffling, and multi‑process loading.
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, data, labels):
self.data = data
self.labels = labels
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx]
dataset = CustomDataset(data, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
for batch_data, batch_labels in dataloader:
# training code here
pass08 Specialized Layers
Convolutional layers for image or sequence data:
# 2D convolution (image)
conv2d = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)
# 1D convolution (time series / text)
conv1d = nn.Conv1d(in_channels=2, out_channels=32, kernel_size=5)Recurrent layers for temporal dependencies:
# LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True, dropout=0.2)
# GRU
gru = nn.GRU(input_size=10, hidden_size=20, num_layers=2, batch_first=True)09 Regularization Techniques
Dropout randomly disables neurons during training:
dropout = nn.Dropout(p=0.2)
x = torch.randn(32, 100)
x_dropped = dropout(x)Normalization layers stabilize and accelerate learning:
# BatchNorm for fully‑connected layers
batch_norm = nn.BatchNorm1d(256)
# LayerNorm for RNN/Transformer layers
layer_norm = nn.LayerNorm(256)BatchNorm normalizes across the batch dimension; LayerNorm normalizes across feature dimensions.
10 Model Mode Switching: Train vs Inference
Use model.train() to enable dropout and batch‑norm statistics; model.eval() disables dropout and uses running statistics. Wrap inference in torch.no_grad() to save memory.
model.train() # training mode
model.eval() # evaluation / inference mode
with torch.no_grad():
predictions = model(test_data)11 GPU Acceleration
PyTorch can offload computation to CUDA‑enabled GPUs, dramatically speeding up training.
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current GPU: {torch.cuda.get_device_name(0)}")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
data = data.to(device)
labels = labels.to(device)12 Model Saving & Loading
Two common strategies:
Full model (convenient but less flexible):
# Save entire model
torch.save(model, 'full_model.pth')
# Load
model = torch.load('full_model.pth')
model.eval()State dictionary (recommended):
# Save state dict
torch.save(model.state_dict(), 'model_weights.pth')
# Load
model = NeuralNet() # instantiate with same architecture
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()Checkpoints can store additional training metadata (epoch, optimizer state, loss).
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')13 Practical Tips & Best Practices
Mixed‑precision training reduces memory usage and speeds up computation:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for data, target in dataloader:
optimizer.zero_grad()
with autocast():
output = model(data)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()Performance profiling with torch.profiler helps locate bottlenecks:
from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
with record_function("model_inference"):
output = model(data)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))Mastering these fundamentals enables building models ranging from simple regression to complex image recognition and natural‑language processing tasks.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
