Can Text-to-Image Models Forget Prompts? Prompt Reinjection Boosts Instruction Following Without Retraining

The paper reveals that multimodal diffusion transformers often lose fine‑grained textual semantics in deeper layers—a phenomenon called Prompt Forgetting—and introduces Prompt Reinjection, a training‑free inference technique that re‑injects shallow text features to markedly improve text‑image alignment and instruction compliance while preserving visual quality and incurring negligible computational overhead.

Machine Heart
Machine Heart
Machine Heart
Can Text-to-Image Models Forget Prompts? Prompt Reinjection Boosts Instruction Following Without Retraining

Recent text‑to‑image diffusion models such as Stable Diffusion, FLUX, and Qwen‑Image have achieved impressive image quality, yet they frequently fail on prompts containing multiple objects, colors, quantities, or spatial relations. The authors attribute this to a newly identified phenomenon called Prompt Forgetting , where textual token representations degrade as they pass through deeper layers of the multimodal diffusion Transformer (MMDiT).

In MMDiT architectures, text and image tokens evolve together within a shared Transformer stack. While image tokens receive direct supervision from the denoising objective, text tokens are only indirectly updated via their influence on image generation, causing the model to prioritize image reconstruction over preserving textual semantics.

Through layer‑wise probing, CKNNA and PCA visualizations, the researchers demonstrate that token‑level information becomes increasingly unrecoverable with depth. Experiments on SD3, SD3.5, and FLUX show a steady decline in the recognition accuracy of nouns, adjectives, numerals, and especially spatial relation tokens.

To counteract this, the authors propose Prompt Reinjection , a simple inference‑time method that re‑injects shallow text features into deeper MMDiT blocks. Because shallow layers retain richer prompt semantics, the technique restores lost information without any model retraining or parameter changes.

The method includes two alignment modules to handle distribution and geometric mismatches between shallow and deep features:

Distribution Anchoring : normalizes and rescales shallow features to match the statistical scale of target deep layers, preventing disruption of the generation distribution.

Geometry Alignment : applies an orthogonal Procrustes transform to align the geometric orientation of shallow and deep feature spaces.

Extensive evaluation on five mainstream MMDiT models (SD3‑medium, SD3.5‑large, FLUX.1‑dev, HunyuanImage‑2.1, Qwen‑Image) across benchmarks such as GenEval, DPG‑Bench, and T2I‑CompBench++ shows consistent gains. For example, Prompt Reinjection raises GenEval scores by 6.48 % for SD3.5 and 7.75 % for HunyuanImage‑2.1. Improvements are most pronounced on tasks requiring fine‑grained textual understanding—attribute binding, quantity reasoning, multi‑object composition, and spatial relation modeling—mirroring the probing findings about spatial‑relation token forgetting.

Importantly, visual quality metrics (HPSv2, ImageReward, PickScore, CLIP) remain stable or improve slightly, indicating that the technique does not trade image fidelity for better semantic alignment.

Computational overhead is minimal. On SD3‑medium, the basic reinjection adds only ~0.00002× the FLOPs of a single Transformer block, while the full version with both alignment modules adds ~0.088× FLOPs, confirming near‑zero impact on inference cost.

In summary, Prompt Reinjection uncovers a critical internal limitation of current MMDiT designs—insufficient preservation of textual conditions in deep layers—and offers a lightweight, plug‑and‑play solution that enhances instruction compliance without sacrificing image quality, providing valuable insights for future controllable diffusion model architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Text-to-Image GenerationICML 2026Multimodal Diffusion TransformersPrompt ForgettingPrompt Reinjection
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.