A Comprehensive Overview of Text-to-Image Generation: From GANs to Stable Diffusion and Advanced Techniques
The article traces the evolution of text‑to‑image generation from early GANs through auto‑regressive and CLIP‑guided diffusion models, explains Stable Diffusion’s architecture and prompt engineering, and reviews advanced personalization techniques such as Textual Inversion, DreamBooth, ControlNet, plus efficient OneFlow deployment and diverse real‑world applications.
AIGC (Artificial Intelligence Generated Content) has become a buzzword. This article reviews the evolution of text‑to‑image generation, focusing on the current mainstream model Stable Diffusion and presenting experimental results under various scenes and style controls.
Technical Evolution 1: Early GAN Family
GANs introduced a new paradigm for image synthesis. The generator (G) creates images from random noise, while the discriminator (D) judges realism, forming a zero‑sum game that drives G to learn the data distribution.
G receives a random vector z and outputs an image.
D takes an image x and outputs the probability that x is real.
The training objective is for G to fool D, while D tries to distinguish real from fake, converging to a Nash equilibrium.
Technical Evolution 2: Auto‑Regressive Models
Inspired by GPT, image‑GPT (iGPT) treats a flattened image as a sequence of discrete tokens and uses a Transformer for auto‑regressive generation. Two common strategies are:
Combine VQ‑VAE tokenization with GPT to generate images from text.
Use CLIP to embed an image, convert it to VQGAN tokens, and train a Transformer to map text embeddings to image tokens.
Technical Evolution 3: CLIP + Diffusion Models
Diffusion models generate images in two stages: adding noise (forward diffusion) and denoising (reverse diffusion). Stable Diffusion leverages CLIP for text encoding and a latent diffusion process for efficient generation.
Stable Diffusion Architecture
The system consists of three main components:
ClipText: encodes the text prompt into 77 tokens (each 768‑dimensional).
UNet + Scheduler: processes the latent noise conditioned on the text embeddings.
Autoencoder Decoder: decodes the processed latent into a final image.
Introducing a latent space replaces direct pixel‑space denoising, improving efficiency.
Prompt Design
Prompt format is divided into three lines: (1) quality and style tags, (2) subject description, (3) scene or embedding tags. Example:
A beautiful painting of a singular lighthouse, shining its light across a tumultuous sea of blood, by Greg Rutkowski and Thomas Kinkade, trending on ArtStation, yellow color scheme
Several online tools (e.g., promptomania.com) help generate such prompts.
Stage 1: Blind‑Box Era
Using CLIP + Diffusion, random seeds and samplers produce diverse outputs with a success rate of ~15%.
Stage 2: Model‑Customization Era
Techniques like Textual Inversion and DreamBooth enable personalized concepts:
Textual Inversion adds a special token S* representing a new concept, learns its embedding from a few example images, and uses it in prompts.
DreamBooth fine‑tunes the model on a small set of images (3‑5) with a class‑specific prior loss to preserve generalization.
Both methods rely on preserving the original model weights while learning new embeddings.
Stage 3: ControlNet Era
ControlNet adds an extra conditioning input (e.g., sketches, edge maps, depth) to a frozen diffusion model via a trainable copy, enabling precise control over generated details.
OneFlow Implementation
OneFlow provides a static‑graph mode for efficient inference. Example code to switch from PyTorch to OneFlow:
import oneflow as torch from diffusers import OneFlowStableDiffusionPipeline as StableDiffusionPipeline
A performance table compares PyTorch and OneFlow on V100 GPUs, showing higher iteration rates and lower latency for OneFlow.
Applications
Product generation for e‑commerce platforms.
Style transfer and artistic creation.
One‑click makeup simulation.
Scenario generation with specified styles.
Video production pipelines (scene creation, character modeling, script design, voice synthesis, animation rendering).
References
https://baijiahao.baidu.com/…
https://github.com/CompVis/stable-diffusion
Textual Inversion paper: “An image is worth one word”.
DreamBooth paper: “Fine‑tuning text‑to‑image diffusion models for subject‑driven generation”.
ControlNet paper: “Adding Conditional Control to Text‑to‑Image Diffusion Models”.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.