Beginner’s Guide to VAE: Theory, Training, and Full Implementation

This article walks readers through the fundamentals of Variational Autoencoders, compares five major generative model paradigms, explains VAE architecture, training and inference steps, provides PyTorch code, and analyzes experimental results on MNIST and Flowers datasets.

xkx's Tech General Store
xkx's Tech General Store
xkx's Tech General Store
Beginner’s Guide to VAE: Theory, Training, and Full Implementation

VAE Overview

Variational Autoencoder (VAE) is a probabilistic generative model that maps an input image to a continuous latent distribution (mean μ and variance σ) and reconstructs images by decoding samples drawn from that distribution. The training objective combines a reconstruction loss (MSE) with a KL‑divergence loss to regularize the latent space toward a standard normal distribution.

Architecture

Probabilistic Encoder : a stack of Conv2d layers with BatchNorm2d and LeakyReLU that down‑samples a 64×64 image to an 8192‑dimensional feature vector, then projects to μ and logvar via fully‑connected layers.

Reparameterization : implements<br>

std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
z = mu + eps * std

to obtain a differentiable latent sample.

Probabilistic Decoder : symmetric ConvTranspose2d up‑sampling layers ending with a Sigmoid to produce a reconstructed image of the original size.

Training Process

Load pre‑processed images (e.g., 64×64 flowers) with a batch size (e.g., batch_size=32).

Encode each image to obtain μ and logvar.

Apply the reparameterization trick to sample z = μ + σ·ε, where ε ∼ N(0,1).

Decode z to obtain the reconstruction x′.

Compute the total loss as the weighted sum of reconstruction loss (MSE) and KL‑divergence loss. The KL term uses the standard formula<br>

-0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

and is scaled by a β weight (e.g., beta=0.5).

Back‑propagate the loss to update model parameters and repeat until convergence.

Inference

Generate arbitrary images by sampling z ∼ N(0,1) (dimension matches the training latent space) and feeding it to the decoder.

Generate a specific class (e.g., red roses) by averaging the μ vectors of a set of target images, adding a small Gaussian perturbation ( z = μ̄ + 0.1·ε, ε∼N(0,1)), and decoding the result.

Implementation Details

The PyTorch implementation provides the following methods: encode: returns μ and logvar for an input image. reparameterize: executes the three‑line reparameterization code shown above. decode: maps a latent vector z through a fully‑connected layer, reshapes to 4×4×512, and applies transposed convolutions to reconstruct the image. forward: combines encode → reparameterize → decode and returns the reconstruction together with μ and logvar for loss computation. sample: draws z ∼ N(0,1) and runs decode to produce a generated image without requiring an input.

Training Configuration

--data_dir ./mnist

: path to the MNIST dataset. --epochs 30: number of training epochs. --batch_size 128: batch size per iteration. --image_size 28: input image size (MNIST is 28×28). --latent_dim 128: dimensionality of the latent vector. --lr 0.001: learning rate. --beta 0.5: weight of the KL loss term.

Experimental Results

Loss curves gradually stabilize over training. On MNIST the reconstructions are sharp and preserve digit identity; on the Flowers dataset the reconstructions appear blurry. Reported visualizations include:

Reconstruction examples at epochs 5, 15, and 30.

Latent‑space distribution plots showing μ centered near zero and σ close to one, indicating effective KL regularization.

Linear interpolation between digits (e.g., 4 → 9) demonstrating smooth semantic transitions.

Per‑dimension traversal showing limited semantic change, suggesting redundancy in the latent representation for the current data.

Reconstruction error analysis and samples generated from the prior distribution.

Analysis

VAE performs well on simple grayscale datasets such as MNIST because the Gaussian prior and KL regularization can capture the dominant low‑frequency structure. On more complex color images the model struggles: the Gaussian prior forces the latent space to model only average features, the KL term penalizes deviation from the prior, and the MSE reconstruction loss smooths high‑frequency details, resulting in blurred outputs.

Conclusion

VAE provides a solid foundation for generative modeling and illustrates the trade‑off between latent‑space regularity and reconstruction fidelity. Its limitations on high‑frequency details and color images motivate the use of more advanced paradigms (e.g., diffusion models or flow‑based models) for later stages of generative research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

deep learninggenerative modelsPyTorchMNISTVAElatent spaceVariational Autoencoder
xkx's Tech General Store
Written by

xkx's Tech General Store

Code with the left hand, enjoy with the right; a keystroke sweeps away worries.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.