Artificial Intelligence 16 min read

Master 20 Essential PyTorch Concepts: From Tensors to Model Deployment

This guide walks you through 20 fundamental PyTorch concepts—including tensor creation, operations, autograd, model building, data loading, GPU acceleration, and best‑practice tricks—providing clear code snippets and step‑by‑step explanations so you can quickly prototype, train, and deploy neural networks.

Data Party THU

Dec 20, 2025

Master 20 Essential PyTorch Concepts: From Tensors to Model Deployment

01 PyTorch Basics: Tensors

PyTorch's core data structure is the torch.Tensor, a GPU‑optimized analogue of a NumPy array. Tensors can be created from Python lists, with explicit shapes, or filled with zeros, ones, uniform random values, or values drawn from a standard normal distribution.

import torch
x = torch.tensor([1, 2, 3])
zeros_tensor = torch.zeros(3, 2)   # 3×2 zero tensor
ones_tensor = torch.ones(3, 2)    # 3×2 ones tensor
random_tensor = torch.rand(2, 2) # uniform random
normal_tensor = torch.randn(2, 2) # standard normal

Tensors interoperate seamlessly with NumPy arrays, enabling smooth integration with the broader scientific‑computing ecosystem.

import numpy as np
x_numpy = np.array([0.1, 0.2, 0.3])
x_torch = torch.from_numpy(x_numpy)

y_torch = torch.tensor([3, 4, 5.])
y_numpy = y_torch.numpy()

02 Tensor Operations

Key manipulation methods include view (requires contiguous memory), reshape (works regardless), unsqueeze, squeeze, transpose, and permute. Understanding contiguity prevents runtime errors when reshaping.

x = torch.randn(2, 3, 4)
print(x.is_contiguous())  # True
# view works on contiguous tensor
y = x.view(6, 4)
# after transpose the tensor becomes non‑contiguous
x_t = x.transpose(1, 2)
print(x_t.is_contiguous())  # False
# safe reshaping
y = x_t.reshape(6, 4)

# dimension tricks
x = torch.randn(3, 4)
x_unsq = x.unsqueeze(0)   # (1, 3, 4)
x_sq = x_unsq.squeeze(0)   # back to (3, 4)
y = x.transpose(0, 1)    # swap dim 0 and 1
z = x.permute(1, 0, 2)   # custom order

03 Automatic Differentiation (autograd)

Operations on tensors with requires_grad=True are recorded in a computation graph. Calling backward() computes gradients automatically.

x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 2*x + 2
y.backward()
print(f"Derivative at x=2: {x.grad}")  # 6.0

For multivariate functions gradients accumulate by default; clear them each iteration with optimizer.zero_grad() (or tensor.grad.zero_()).

def g(w):
    return 2*w[0]*w[1] + w[1]*torch.cos(w[0])

w = torch.tensor([3.14, 1.0], requires_grad=True)
z = g(w)
z.backward()
print(f"Gradients: {w.grad}")  # approx [2.0, 6.28]

04 Building Neural Networks

Two primary patterns:

Subclass nn.Module for full flexibility (custom forward logic).

Stack layers with nn.Sequential for quick prototypes.

import torch.nn as nn
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden = nn.Linear(input_size, hidden_size)
        self.predict = nn.Linear(hidden_size, output_size)
    def forward(self, x):
        x = torch.relu(self.hidden(x))
        return self.predict(x)

model_seq = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 20),
    nn.ReLU(),
    nn.Linear(20, 1)
)

05 Core Components

Activation functions introduce non‑linearity. Common choices: ReLU: max(0, x) – simple, mitigates vanishing gradients; used in most hidden layers. Sigmoid: 1/(1+e^{-x}) – outputs (0,1); typical for binary classification output. Tanh: (e^x‑e^{-x})/(e^x+e^{-x}) – outputs (-1,1); often used in RNN hidden layers. Leaky ReLU: max(αx, x) – alleviates dead‑neuron problem; alternative to ReLU.

Loss functions depend on the task:

# Regression
mse_loss = nn.MSELoss()
mae_loss = nn.L1Loss()
# Classification
ce_loss = nn.CrossEntropyLoss()
bce_loss = nn.BCELoss()

Optimizers control parameter updates. Frequently used variants:

import torch.optim as optim
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer_adam = optim.Adam(model.parameters(), lr=0.001)
optimizer_adamw = optim.AdamW(model.parameters(), lr=0.001)
optimizer_rms = optim.RMSprop(model.parameters(), lr=0.001)

06 Training Loop

A typical training pipeline combines model, loss, optimizer, and an epoch loop.

# Prepare components
model = NeuralNet(input_size=10, hidden_size=20, output_size=1)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(100):
    outputs = model(train_data)
    loss = criterion(outputs, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

07 Data Processing: Dataset & DataLoader

Custom Dataset subclasses define how individual samples are fetched; DataLoader handles batching, shuffling, and multi‑process loading.

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

dataset = CustomDataset(data, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
for batch_data, batch_labels in dataloader:
    # training code here
    pass

08 Specialized Layers

Convolutional layers for image or sequence data:

# 2D convolution (image)
conv2d = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)
# 1D convolution (time series / text)
conv1d = nn.Conv1d(in_channels=2, out_channels=32, kernel_size=5)

Recurrent layers for temporal dependencies:

# LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True, dropout=0.2)
# GRU
gru = nn.GRU(input_size=10, hidden_size=20, num_layers=2, batch_first=True)

09 Regularization Techniques

Dropout randomly disables neurons during training:

dropout = nn.Dropout(p=0.2)
x = torch.randn(32, 100)
x_dropped = dropout(x)

Normalization layers stabilize and accelerate learning:

# BatchNorm for fully‑connected layers
batch_norm = nn.BatchNorm1d(256)
# LayerNorm for RNN/Transformer layers
layer_norm = nn.LayerNorm(256)

BatchNorm normalizes across the batch dimension; LayerNorm normalizes across feature dimensions.

10 Model Mode Switching: Train vs Inference

Use model.train() to enable dropout and batch‑norm statistics; model.eval() disables dropout and uses running statistics. Wrap inference in torch.no_grad() to save memory.

model.train()   # training mode
model.eval()    # evaluation / inference mode
with torch.no_grad():
    predictions = model(test_data)

11 GPU Acceleration

PyTorch can offload computation to CUDA‑enabled GPUs, dramatically speeding up training.

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current GPU: {torch.cuda.get_device_name(0)}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
data = data.to(device)
labels = labels.to(device)

12 Model Saving & Loading

Two common strategies:

Full model (convenient but less flexible):

# Save entire model
torch.save(model, 'full_model.pth')
# Load
model = torch.load('full_model.pth')
model.eval()

State dictionary (recommended):

# Save state dict
torch.save(model.state_dict(), 'model_weights.pth')
# Load
model = NeuralNet()  # instantiate with same architecture
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

Checkpoints can store additional training metadata (epoch, optimizer state, loss).

checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')

13 Practical Tips & Best Practices

Mixed‑precision training reduces memory usage and speeds up computation:

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for data, target in dataloader:
    optimizer.zero_grad()
    with autocast():
        output = model(data)
        loss = criterion(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Performance profiling with torch.profiler helps locate bottlenecks:

from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    with record_function("model_inference"):
        output = model(data)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Mastering these fundamentals enables building models ranging from simple regression to complex image recognition and natural‑language processing tasks.

deep learning GPU acceleration model training PyTorch Tensor Operations data loading

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.