Artificial Intelligence 12 min read

Understanding Stable Diffusion Architecture and Implementing It with the Diffusers Library

This article reviews the evolution from GANs to diffusion models, explains the components of Stable Diffusion—including the CLIP text encoder, VAE, and UNet—and provides step‑by‑step Python code using HuggingFace's Diffusers library to generate images from text prompts.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Understanding Stable Diffusion Architecture and Implementing It with the Diffusers Library

1. Introduction

Reviewing the history of AI‑generated art, GANs were the first breakthrough, but recent progress is dominated by Diffusion Models (DM). Stable Diffusion, the most popular open‑source DM, powers many community projects such as WebUI, ComfyUI, Fooocus, Civitai, and the HuggingFace Diffusers library.

2. Network Structure

Stable Diffusion consists of three main sub‑networks: a CLIP‑based text encoder, a UNet noise‑prediction model, and a VAE for latent‑space compression and decoding.

2.1 Overall Architecture

The generation pipeline follows these steps:

Encode the input text with CLIP to obtain a text embedding.

Sample a random latent tensor from a normal distribution.

Feed the latent and text embedding into UNet to predict noise.

Subtract the predicted noise from the latent.

Repeat steps 3‑4 for many denoising iterations.

Decode the final latent with the VAE decoder to produce the image.

2.2 Text Encoder

Stable Diffusion uses OpenAI's CLIP model rather than a generic BERT encoder because CLIP aligns image and text representations, enabling more faithful text‑conditioned generation.

2.3 VAE Model

The VAE compresses images into a lower‑dimensional latent space, reducing computational cost. It consists of an encoder that maps an image x to a mean μ and variance σ , samples z , and a decoder that reconstructs the image from z .

2.4 UNet Model

UNet acts as a noise‑prediction network. During generation it repeatedly receives a noisy latent, predicts the added noise, and the scheduler removes it, gradually denoising the latent until a clean image is obtained.

3. Diffusers Module

The HuggingFace diffusers library provides ready‑made pipelines and low‑level components to implement the above steps.

3.1 Using a Pipeline

A simple one‑liner can generate an image from a text prompt:

from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
image = pipeline("stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k").images[0]
image

3.2 Loading Individual Components

For finer control, each sub‑module can be loaded separately:

from tqdm.auto import tqdm
from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler

model_path = "runwayml/stable-diffusion-v1-5"
vae = AutoencoderKL.from_pretrained(model_path, subfolder="vae")
tokenizer = CLIPTokenizer.from_pretrained(model_path, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_path, subfolder="text_encoder")
unet = UNet2DConditionModel.from_pretrained(model_path, subfolder="unet")
scheduler = DDPMScheduler.from_pretrained(model_path, subfolder="scheduler")

torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)

3.3 Encoding the Prompt

Tokenize the prompt and obtain text embeddings:

prompt = ["a photograph of an astronaut riding a horse"]
height = 512
width = 512
num_inference_steps = 25
guidance_scale = 7.5
batch_size = len(prompt)
text_input = tokenizer(
    prompt, padding="max_length", max_length=tokenizer.model_max_length,
    truncation=True, return_tensors="pt"
)
with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

3.4 Getting the Latent Variable

Sample random noise and scale it according to the scheduler:

latents = torch.randn(
    (batch_size, unet.config.in_channels, height // 8, width // 8),
    device=torch_device
)
latents = latents * scheduler.init_noise_sigma

3.5 Denoising Loop

Iteratively run UNet, predict noise, and let the scheduler update the latent:

scheduler.set_timesteps(num_inference_steps)
for t in tqdm(scheduler.timesteps):
    latent_model_input = latents
    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)
    with torch.no_grad():
        noise_pred = unet(
            latent_model_input, t, encoder_hidden_states=text_embeddings
        ).sample
    latents = scheduler.step(noise_pred, t, latents).prev_sample

3.6 Decoding with VAE

Transform the denoised latent back to pixel space:

latents = 1 / 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample
    image = (image / 2 + 0.5).clamp(0, 1).squeeze()
    image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
    image = Image.fromarray(image)
    image.show()

4. Conclusion

The tutorial traced AI‑painting from early GANs to modern diffusion models, detailed the internal architecture of Stable Diffusion, and demonstrated a complete end‑to‑end implementation using the Diffusers library. Future extensions may include LoRA, ControlNet, and other conditioning techniques.

Pythondeep learningdiffusion modelStable DiffusionAI Painting
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.