Understanding Diffusion Models, Autoencoders, and VAEs for AIGC with Code Examples
This article introduces the hot AIGC field by explaining diffusion‑based image generation, detailing the principles and mathematics of AutoEncoder and Variational AutoEncoder models, and providing complete TensorFlow code examples to help readers master these generative techniques step by step.
1. Introduction
AIGC is currently a very hot direction; models such as DALL·E‑2, ImageGen, and Stable Diffusion can generate photorealistic yet creatively imaginative images, and the following pictures are generated with the open‑source Stable Diffusion.
All these models rely on Diffusion Model technology, but without background knowledge the learning curve is steep; following the progression from AE, VAE, CVAE to DDPM provides a clearer path to understand and master these models.
2. AE (AutoEncoder)
The AE model extracts core features (latent attributes) from data; if the low‑dimensional features can perfectly reconstruct the original data, they serve as an excellent representation.
The AE architecture is shown below.
Training data are encoded into a latent vector, which is then decoded back to reconstructed data; the reconstruction loss guides training. The following TensorFlow code implements a simple convolutional AE on MNIST:
class DownConvLayer(tf.keras.layers.Layer):
def __init__(self, dim):
super(DownConvLayer, self).__init__()
self.conv = tf.keras.layers.Conv2D(dim, 3, activation=tf.keras.layers.ReLU(), use_bias=False, padding='same')
self.pool = tf.keras.layers.MaxPool2D(2)
def call(self, x, training=False, **kwargs):
x = self.conv(x)
x = self.pool(x)
return x
class UpConvLayer(tf.keras.layers.Layer):
def __init__(self, dim):
super(UpConvLayer, self).__init__()
self.conv = tf.keras.layers.Conv2D(dim, 3, activation=tf.keras.layers.ReLU(), use_bias=False, padding='same')
# Upsampling
self.pool = tf.keras.layers.UpSampling2D(2)
def call(self, x, training=False, **kwargs):
x = self.conv(x)
x = self.pool(x)
return x
class Encoder(tf.keras.layers.Layer):
def __init__(self, dim, layer_num=3):
super(Encoder, self).__init__()
self.convs = [DownConvLayer(dim) for _ in range(layer_num)]
def call(self, x, training=False, **kwargs):
for conv in self.convs:
x = conv(x, training)
return x
class Decoder(tf.keras.layers.Layer):
def __init__(self, dim, layer_num=3):
super(Decoder, self).__init__()
self.convs = [UpConvLayer(dim) for _ in range(layer_num)]
self.final_conv = tf.keras.layers.Conv2D(1, 3, strides=1)
def call(self, x, training=False, **kwargs):
for conv in self.convs:
x = conv(x, training)
reconstruct = self.final_conv(x)
return reconstruct
class AutoEncoderModel(tf.keras.Model):
def __init__(self):
super(AutoEncoderModel, self).__init__()
self.encoder = Encoder(64, layer_num=3)
self.decoder = Decoder(64, layer_num=3)
def call(self, inputs, training=None, mask=None):
image = inputs[0]
latent = self.encoder(image, training)
reconstruct_img = self.decoder(latent, training)
return reconstruct_img
@tf.function
def train_step(self, data):
img = data["image"]
with tf.GradientTape() as tape:
reconstruct_img = self((img,), True)
l2_loss = (reconstruct_img - img) ** 2
l2_loss = tf.reduce_mean(tf.reduce_sum(l2_loss, axis=(1, 2, 3)))
gradients = tape.gradient(l2_loss, self.trainable_variables)
self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
return {"l2_loss": l2_loss}From the AE model we see that as long as the latent representation captures the data well, the decoder can reconstruct the input, but the latent is derived from existing data, so AE cannot generate truly novel samples.
Therefore we hypothesize that if the latent follows a known distribution that can be parameterized, we could sample new latents and generate new data—this idea leads to the Variational AutoEncoder (VAE).
3. VAE
VAE assumes the latent variable \(z\) follows a normal distribution; during training the model learns the mean and variance of this distribution.
Training a VAE requires optimizing two objectives: (1) reconstruction loss (e.g., L2 or L1) to make generated data close to the input, and (2) a KL‑divergence term that forces the learned latent distribution to match the standard normal distribution.
Because the article contains many formulas (over 140), the original platform cannot display them well; a link to the full document is provided for detailed reading.
Laiye Technology Team
Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.