Artificial Intelligence 16 min read

Survey of Text‑Controlled Image Generation Models: DALL·E‑2, Imagen, Stable Diffusion, and ControlNet

This article reviews the key components and design choices of recent text‑controlled image generation systems—including DALL·E‑2, Google Imagen, Stability AI's Latent Stable Diffusion, and the ControlNet extension—highlighting how diffusion models, text encoders, prior modules, super‑resolution, and conditioning mechanisms enable high‑quality, controllable visual synthesis.

Laiye Technology Team

Mar 3, 2023

Survey of Text‑Controlled Image Generation Models: DALL·E‑2, Imagen, Stable Diffusion, and ControlNet

The article begins by outlining practical challenges in controllable image generation such as injecting conditions, producing high‑resolution realistic images, reducing model size, and using non‑textual controls.

It then surveys representative models:

DALL·E‑2 is described as a classifier‑free diffusion system composed of four stages: a text encoder (using CLIP), a prior module that maps text embeddings to image embeddings, a diffusion decoder, and a super‑resolution (SR) module that upsamples 64×64 outputs to 1024×1024. The prior and decoder are trained separately, and the article provides a Python implementation of the CLIP contrastive loss used in the text encoder:

def clip_training(imgs, texts):
    # imgs 和texts的batch size 尽可能大，对比学习bs越大，效果越好
    # 图像编码器和文本编码器将图像和文本编码成维度相当的embedding
    img_embedding = img_encoder(imgs)
    txt_embedding = text_encoder(texts)
    norm_img_embedding = tf.nn.l2_normalize(img_embedding, -1)
    norm_txt_embedding = tf.nn.l2_normalize(txt_embedding, -1)
    logits = tf.matmul(norm_txt_embedding, img_embedding, transpose_b=True)
    batch_size = tf.range(tf.shape(imgs)[0])
    label = tf.range(batch_size)
    # infonce 的简化实现，认为相似度矩阵的对角线部分为label=1,其他部分label=0
    loss1 = tf.keras.losses.sparse_categorical_crossentropy(
        label, logits, from_logits=True
    )
    # 非对角部分交换，实现t和i,i和t的计算
    loss2 = tf.keras.losses.sparse_categorical_crossentropy(
        label, tf.transpose(logits), from_logits=True
    )
    return (loss1 + loss2) / 2.0

The prior module predicts image embeddings from text embeddings, optionally selecting the most similar embedding among multiple candidates. The decoder generates images from these embeddings, and the SR module uses diffusion‑based super‑resolution to upscale images in two stages.

Imagen follows a similar pipeline but replaces the CLIP text encoder with a large T5‑XXL model, and its diffusion decoder directly consumes the text embedding as the conditioning signal. Two cascaded SR modules further upscale the 64×64 output, incorporating both low‑resolution images and text embeddings via cross‑attention.

Latent Stable Diffusion Model (LDM) operates in a compressed latent space learned by a VAE or VQ‑VAE, dramatically reducing computation. It uses cross‑attention to fuse conditioning information (e.g., text, images) with latent diffusion steps, and includes regularization terms (KL or VQ‑VAE losses) to keep latent variance tractable.

The article explains LDM’s three‑part architecture: a latent auto‑encoder, a conditioning module that can accept arbitrary inputs, and a diffusion model that predicts noise in latent space, later decoded back to pixel space.

ControlNet extends LDM by adding a trainable copy of the frozen base model and two zero‑initialized 1×1 convolutions. The first zero‑conv receives external control features (e.g., edge maps, pose), while the second merges them with the base UNet via residual connections, enabling fine‑grained control with minimal additional data.

Finally, the article lists references to original papers, blogs, and code repositories for each model.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ai Stable Diffusion text-to-image Diffusion Models ControlNet DALL-E-2 imagen

Written by

Laiye Technology Team

Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.