Artificial Intelligence 17 min read

A Comprehensive Overview of Text-to-Image Generation: From GANs to Stable Diffusion and Advanced Techniques

The article traces the evolution of text‑to‑image generation from early GANs through auto‑regressive and CLIP‑guided diffusion models, explains Stable Diffusion’s architecture and prompt engineering, and reviews advanced personalization techniques such as Textual Inversion, DreamBooth, ControlNet, plus efficient OneFlow deployment and diverse real‑world applications.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
A Comprehensive Overview of Text-to-Image Generation: From GANs to Stable Diffusion and Advanced Techniques

AIGC (Artificial Intelligence Generated Content) has become a buzzword. This article reviews the evolution of text‑to‑image generation, focusing on the current mainstream model Stable Diffusion and presenting experimental results under various scenes and style controls.

Technical Evolution 1: Early GAN Family

GANs introduced a new paradigm for image synthesis. The generator (G) creates images from random noise, while the discriminator (D) judges realism, forming a zero‑sum game that drives G to learn the data distribution.

G receives a random vector z and outputs an image.

D takes an image x and outputs the probability that x is real.

The training objective is for G to fool D, while D tries to distinguish real from fake, converging to a Nash equilibrium.

Technical Evolution 2: Auto‑Regressive Models

Inspired by GPT, image‑GPT (iGPT) treats a flattened image as a sequence of discrete tokens and uses a Transformer for auto‑regressive generation. Two common strategies are:

Combine VQ‑VAE tokenization with GPT to generate images from text.

Use CLIP to embed an image, convert it to VQGAN tokens, and train a Transformer to map text embeddings to image tokens.

Technical Evolution 3: CLIP + Diffusion Models

Diffusion models generate images in two stages: adding noise (forward diffusion) and denoising (reverse diffusion). Stable Diffusion leverages CLIP for text encoding and a latent diffusion process for efficient generation.

Stable Diffusion Architecture

The system consists of three main components:

ClipText: encodes the text prompt into 77 tokens (each 768‑dimensional).

UNet + Scheduler: processes the latent noise conditioned on the text embeddings.

Autoencoder Decoder: decodes the processed latent into a final image.

Introducing a latent space replaces direct pixel‑space denoising, improving efficiency.

Prompt Design

Prompt format is divided into three lines: (1) quality and style tags, (2) subject description, (3) scene or embedding tags. Example:

A beautiful painting of a singular lighthouse, shining its light across a tumultuous sea of blood, by Greg Rutkowski and Thomas Kinkade, trending on ArtStation, yellow color scheme

Several online tools (e.g., promptomania.com) help generate such prompts.

Stage 1: Blind‑Box Era

Using CLIP + Diffusion, random seeds and samplers produce diverse outputs with a success rate of ~15%.

Stage 2: Model‑Customization Era

Techniques like Textual Inversion and DreamBooth enable personalized concepts:

Textual Inversion adds a special token S* representing a new concept, learns its embedding from a few example images, and uses it in prompts.

DreamBooth fine‑tunes the model on a small set of images (3‑5) with a class‑specific prior loss to preserve generalization.

Both methods rely on preserving the original model weights while learning new embeddings.

Stage 3: ControlNet Era

ControlNet adds an extra conditioning input (e.g., sketches, edge maps, depth) to a frozen diffusion model via a trainable copy, enabling precise control over generated details.

OneFlow Implementation

OneFlow provides a static‑graph mode for efficient inference. Example code to switch from PyTorch to OneFlow:

import oneflow as torch from diffusers import OneFlowStableDiffusionPipeline as StableDiffusionPipeline

A performance table compares PyTorch and OneFlow on V100 GPUs, showing higher iteration rates and lower latency for OneFlow.

Applications

Product generation for e‑commerce platforms.

Style transfer and artistic creation.

One‑click makeup simulation.

Scenario generation with specified styles.

Video production pipelines (scene creation, character modeling, script design, voice synthesis, animation rendering).

References

https://baijiahao.baidu.com/…

https://github.com/CompVis/stable-diffusion

Textual Inversion paper: “An image is worth one word”.

DreamBooth paper: “Fine‑tuning text‑to‑image diffusion models for subject‑driven generation”.

ControlNet paper: “Adding Conditional Control to Text‑to‑Image Diffusion Models”.

prompt engineeringAI artStable Diffusiontext-to-imagediffusion modelsmachine learninggenerative AI
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.