Understanding Diffusion Models, Autoencoders, and VAEs for AIGC with Code Examples
This article introduces the hot AIGC field by explaining diffusion‑based image generation, detailing the principles and mathematics of AutoEncoder and Variational AutoEncoder models, and providing complete TensorFlow code examples to help readers master these generative techniques step by step.
1. Introduction
AIGC is currently a very hot direction; models such as DALL·E‑2, ImageGen, and Stable Diffusion can generate photorealistic yet creatively imaginative images, and the following pictures are generated with the open‑source Stable Diffusion.
tags retained in source tables -->
All these models rely on Diffusion Model technology, but without background knowledge the learning curve is steep; following the progression from AE, VAE, CVAE to DDPM provides a clearer path to understand and master these models.
2. AE (AutoEncoder)
The AE model extracts core features (latent attributes) from data; if the low‑dimensional features can perfectly reconstruct the original data, they serve as an excellent representation.
The AE architecture is shown below.
Training data are encoded into a latent vector, which is then decoded back to reconstructed data; the reconstruction loss guides training. The following TensorFlow code implements a simple convolutional AE on MNIST:
class DownConvLayer(tf.keras.layers.Layer):<br/> def __init__(self, dim):<br/> super(DownConvLayer, self).__init__()<br/> self.conv = tf.keras.layers.Conv2D(dim, 3, activation=tf.keras.layers.ReLU(), use_bias=False, padding='same')<br/> self.pool = tf.keras.layers.MaxPool2D(2)<br/><br/> def call(self, x, training=False, **kwargs):<br/> x = self.conv(x)<br/> x = self.pool(x)<br/> return x<br/><br/>class UpConvLayer(tf.keras.layers.Layer):<br/> def __init__(self, dim):<br/> super(UpConvLayer, self).__init__()<br/> self.conv = tf.keras.layers.Conv2D(dim, 3, activation=tf.keras.layers.ReLU(), use_bias=False, padding='same')<br/> # Upsampling<br/> self.pool = tf.keras.layers.UpSampling2D(2)<br/><br/> def call(self, x, training=False, **kwargs):<br/> x = self.conv(x)<br/> x = self.pool(x)<br/> return x<br/><br/>class Encoder(tf.keras.layers.Layer):<br/> def __init__(self, dim, layer_num=3):<br/> super(Encoder, self).__init__()<br/> self.convs = [DownConvLayer(dim) for _ in range(layer_num)]<br/><br/> def call(self, x, training=False, **kwargs):<br/> for conv in self.convs:<br/> x = conv(x, training)<br/> return x<br/><br/>class Decoder(tf.keras.layers.Layer):<br/> def __init__(self, dim, layer_num=3):<br/> super(Decoder, self).__init__()<br/> self.convs = [UpConvLayer(dim) for _ in range(layer_num)]<br/> self.final_conv = tf.keras.layers.Conv2D(1, 3, strides=1)<br/><br/> def call(self, x, training=False, **kwargs):<br/> for conv in self.convs:<br/> x = conv(x, training)<br/> reconstruct = self.final_conv(x)<br/> return reconstruct<br/><br/>class AutoEncoderModel(tf.keras.Model):<br/> def __init__(self):<br/> super(AutoEncoderModel, self).__init__()<br/> self.encoder = Encoder(64, layer_num=3)<br/> self.decoder = Decoder(64, layer_num=3)<br/><br/> def call(self, inputs, training=None, mask=None):<br/> image = inputs[0]<br/> latent = self.encoder(image, training)<br/> reconstruct_img = self.decoder(latent, training)<br/> return reconstruct_img<br/><br/> @tf.function<br/> def train_step(self, data):<br/> img = data["image"]<br/> with tf.GradientTape() as tape:<br/> reconstruct_img = self((img,), True)<br/> l2_loss = (reconstruct_img - img) ** 2<br/> l2_loss = tf.reduce_mean(tf.reduce_sum(l2_loss, axis=(1, 2, 3)))<br/> gradients = tape.gradient(l2_loss, self.trainable_variables)<br/> self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))<br/> return {"l2_loss": l2_loss}From the AE model we see that as long as the latent representation captures the data well, the decoder can reconstruct the input, but the latent is derived from existing data, so AE cannot generate truly novel samples.
Therefore we hypothesize that if the latent follows a known distribution that can be parameterized, we could sample new latents and generate new data—this idea leads to the Variational AutoEncoder (VAE).
3. VAE
VAE assumes the latent variable \(z\) follows a normal distribution; during training the model learns the mean and variance of this distribution.
Training a VAE requires optimizing two objectives: (1) reconstruction loss (e.g., L2 or L1) to make generated data close to the input, and (2) a KL‑divergence term that forces the learned latent distribution to match the standard normal distribution.
Because the article contains many formulas (over 140), the original platform cannot display them well; a link to the full document is provided for detailed reading.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Laiye Technology Team
Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
