PyTorch GPU Memory Profiling: Checkpointing, Mixed Precision, Optimizer Choice

The article explains the seven sources of GPU memory usage during PyTorch training, shows how to measure them with built‑in profiling APIs and the memory‑viz tool, and evaluates three effective optimizations—gradient checkpointing, mixed‑precision training, and optimizer selection—detailing their memory savings and performance costs.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
PyTorch GPU Memory Profiling: Checkpointing, Mixed Precision, Optimizer Choice

GPU memory consumption factors

Model parameters – the weight tensors.

Gradients – one tensor per parameter, same size as the parameters.

Optimizer state – Adam stores two extra tensors (m and v) per parameter.

Activations – outputs of each layer that must be kept for back‑propagation.

Input batches – data loaded onto the GPU.

CUDA workspace – temporary kernel buffers and cuDNN caches.

Memory fragmentation – allocated blocks that cannot be reused because of gaps.

For a 200 million‑parameter fp32 model trained with Adam, the memory breakdown is roughly:

Parameters: 800 MB

Gradients: 800 MB (same as parameters)

Adam state (m + v): 1 600 MB (2 × parameters)

Activations: 2–10 × parameters (highly variable)

Input batches: depends on batch size

CUDA workspace: 500 MB – 1 GB

Fragmentation: 5 % – 20 % of total memory

Thus a model that theoretically needs only 800 MB can occupy 5–8 GB in practice.

Measuring actual usage

PyTorch provides precise memory‑visibility utilities. The key metrics are:

import torch

# GPU memory actually allocated for tensors (GB)
allocated = torch.cuda.memory_allocated() / 1024**3
# GPU memory reserved by the allocator, including unused portions (GB)
reserved = torch.cuda.memory_reserved() / 1024**3
# Peak allocated memory since the last reset (GB)
peak = torch.cuda.max_memory_allocated() / 1024**3
# Reset the peak‑memory counter
torch.cuda.reset_peak_memory_stats()

The difference reserved - allocated equals fragmented memory. For example, if allocated is 5 GB and reserved is 8 GB, 3 GB are reserved but not efficiently used.

Printing a full allocator‑pool summary shows size‑wise allocation vs. peak values and per‑category details:

print(torch.cuda.memory_summary())

Memory‑history visualization

PyTorch can record every allocation and dump a snapshot:

torch.cuda.memory._record_memory_history(max_entries=100_000)

# Run one training step
output = model(x)
loss = criterion(output, y)
loss.backward()
optimizer.step()

# Save the snapshot
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
# Disable further recording
torch.cuda.memory._record_memory_history(enabled=None)

Upload the generated memory_snapshot.pickle to https://pytorch.org/memory_viz to view an interactive UI that shows each allocation, release, and the full call stack that triggered it.

Optimization techniques

1. Gradient checkpointing (compute‑for‑memory)

Activations are usually the largest memory consumer. Gradient checkpointing recomputes activations during the backward pass instead of storing them.

from torch.utils.checkpoint import checkpoint

class MyBlock(nn.Module):
    def forward(self, x):
        return checkpoint(self._forward, x, use_reentrant=False)

    def _forward(self, x):
        # Expensive computation here
        return x

Typical savings: 40 %–60 % reduction in activation memory, at the cost of a 20 %–30 % slowdown in backward speed.

2. Mixed‑precision training

from torch.amp import autocast, GradScaler

scaler = GradScaler('cuda')
with autocast('cuda', dtype=torch.float16):
    output = model(x)
    loss = criterion(output, y)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Activations, gradients, and most operations use fp16 (2 bytes per value) while parameters and optimizer states stay in fp32 for stability. Typical savings: 30 %–50 % total memory reduction, and fp16 operations often run faster on modern GPUs.

3. Optimizer choice

Adam stores two extra tensors per parameter; for a 1‑billion‑parameter fp32 model the optimizer state alone consumes ~8 GB.

SGD with momentum: one extra tensor per parameter (half the Adam overhead).

AdamW with bnb.optim.AdamW8bit: stores optimizer state in 8‑bit, cutting memory by a factor of four with negligible accuracy loss.

Lion: memory comparable to SGD, convergence similar to Adam.

For models exceeding one billion parameters, optimizer selection can be the deciding factor for whether training fits on the available hardware.

Conclusion

Measuring GPU memory with the provided PyTorch utilities enables reductions of 30 %–60 %, allowing larger batch sizes, faster training, and better gradient estimates.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PyTorchmemory profilingGPU memorymixed precisiongradient checkpointingoptimizer selection
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.