Probability Basics, Discriminative vs Generative Models, and Autoencoders (including Variational Autoencoders)
This article introduces fundamental probability notation, explains the difference between discriminative and generative models, and provides a comprehensive overview of autoencoders and variational autoencoders, covering their architectures, loss functions, latent spaces, and practical applications in image manipulation.
Background Knowledge
Probability Notation
P(x) = probability that event x occurs.
Joint probability: P(x, y) or P(x \cap y) = probability that both x and y occur.
Conditional probability: P(x | y) = probability that x occurs given that y has occurred.
(This formula shows how conditional probability relates to joint probability.)
Probability Distribution
Describes the probabilities of all possible values of a random variable.
Represented as a function p(x).
Assigns a non‑negative value to each possible x, indicating its likelihood.
The sum (or integral for continuous variables) equals 1 (normalisation).
Example
An image shows the probability distribution of letters a‑z in English text.
The chart reflects the relative frequency of each letter in a corpus.
All probabilities sum to 1.
Probability Density Function (PDF)
Describes the distribution of a continuous random variable.
Represented as a function p(x).
Assigns a non‑negative density to each possible x.
The total area under the curve (integral) equals 1; the probability of a range is the integral over that range.
Example
An image shows the PDF of weekly income in Australia.
The chart visualises the density of each income level and confirms normalisation.
Discriminative vs Generative Models
Discriminative Models
Learn the relationship between input data and output labels, i.e., p(y|x).
Typical examples: logistic regression, support vector machines, neural networks.
Generative Models
Generative models can produce new samples from the learned distribution.
Conditional Generative Model Input: a label (e.g., "frog"). Output: probability distribution p(x|y) over images. Example: generate images that correspond to the given label.
Unconditional Generative Model Output: probability distribution p(x) over images. Generates data without any conditioning label. Example: trained on a mixed animal dataset, it can produce a random animal image.
Bayes' Rule
This rule describes the relationship between discriminative and generative models.
Comparison
Discriminative Models Goal: learn conditional probability p(y|x). Directly model the decision boundary between classes.
Generative Models Goal: learn joint probability p(x, y). Capture how data are generated, enabling sampling and reverse inference p(x|y).
What makes one set of pixels more probable than another? Typically, the frequency of that pixel pattern in the training data and the contextual structure of images.
Autoencoders
Unsupervised Learning
Learn a model from unlabeled data.
Goal: find a compact representation of the data, often with fewer parameters.
Applications:
Provide a simplified model for another ML system.
Dimensionality reduction.
Potentially improve generalisation.
Autoencoder Architecture
Basic architecture
Deep architecture
Neural network used for unsupervised learning.
Sometimes called “self‑supervised” learning.
Output reproduces the input (e.g., an image).
Hidden layer learns a low‑dimensional representation.
Structure
Encoder compresses the input into a hidden (bottleneck) layer.
Hidden layer has fewer neurons than the input layer.
Decoder reconstructs the input from the hidden representation.
Often the encoder and decoder share weights.
Goal: make the output as close as possible to the original input.
Hidden Layer
“Bottleneck” layer smaller than the input.
Represents the input with latent variables.
With a linear activation and one hidden layer, it learns PCA.
Why must the hidden layer be smaller than the input? A smaller hidden layer forces the network to capture the intrinsic structure of the data rather than merely copying it, enabling dimensionality reduction and more informative features.
Output and Loss
The output has the same type as the input (e.g., an image) rather than a class label.
For binary images, use tanh or sigmoid activation.
For natural images, use linear activation.
Loss = difference between input and output (e.g., Mean Squared Error).
Latent Representation
The encoder maps the input to a low‑dimensional latent space that captures the main features and structure of the data.
Variational Autoencoder (VAE)
A probabilistic version of the autoencoder that learns a latent representation and can sample from the model to generate new images.
Assume images are generated from latent variables z following some distribution.
Typical prior p(z) is a standard normal distribution.
Probabilistic Encoder
Input: image x → outputs mean μ z|x and diagonal covariance Σ z|x , defining a Gaussian q(z|x).
Probabilistic Decoder
Input: latent variable z → outputs mean μ x|z and diagonal covariance Σ x|z , defining a Gaussian p(x|z).
Goal: maximise the likelihood p(x) using a variational lower bound.
VAE Architecture
Loss Function
Goal: maximise p(x). The loss is based on the variational lower bound and consists of two terms:
Reconstruction loss – encourages the decoder to produce outputs as similar as possible to the inputs.
Regularisation loss – encourages the latent distribution q(z|x) to be close to the prior p(z) (usually a standard normal) using the Kullback‑Leibler divergence.
Reconstruction Loss
Measures how well the decoder can reconstruct the input (e.g., log‑likelihood of x given the reconstruction).
Regularisation Loss
Measures the KL‑divergence between q(z|x) and the prior p(z), encouraging a meaningful latent space.
Latent Space Properties
Continuous : nearby points generate similar images.
Complete : every point corresponds to a valid image.
A standard normal distribution satisfies both properties, and a diagonal covariance ensures independence of latent variables.
Applications – Image Manipulation
By modifying the latent vector z of an image x, one can create variations of x; however, latent directions may not correspond to interpretable attributes without additional constraints.
VAE Pros and Cons
Pros
Learns approximations of p(z|x) and p(x|z), enabling generation of new instances.
Provides a probabilistic framework for sampling.
Cons
Generated images are often blurry because the model averages over many possible outputs.
Summary
Discriminative models predict labels from images; generative models predict the probability distribution of images.
Autoencoders assume images can be generated from a low‑dimensional latent space.
Regular autoencoders learn a latent representation for reconstruction.
Variational autoencoders are a probabilistic version that allows sampling from the latent space.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.