How VICTORIA Revolutionizes Multi‑Object Image Editing with Language‑Aware Diffusion

The VICTORIA algorithm, presented by Alibaba Cloud AI Platform PAI and South China University of Technology at ACM MM 2024, leverages linguistic dependency parsing to guide cross‑attention in Stable Diffusion, enabling accurate, training‑free multi‑object image editing while preserving spatial structure and achieving state‑of‑the‑art results on benchmark datasets.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How VICTORIA Revolutionizes Multi‑Object Image Editing with Language‑Aware Diffusion

Paper Overview

Bingyan Liu, Chengyu Wang, Jun Huang, Kui Jia propose Attentive Linguistic Tracking in Diffusion Models for Training‑free Text‑guided Image Editing , published at ACM MM 2024. The VICTORIA algorithm introduces a language‑aware approach to multi‑object editing for Stable Diffusion.

Background

Recent text‑to‑image synthesis models such as Stable Diffusion, DALL‑E 2, and Imagen have demonstrated strong generation and editing capabilities. Zero‑shot text‑guided image editing (TIE) methods like Prompt‑to‑Prompt and InstructPix2Pix modify cross‑attention maps to edit specific regions, but they often struggle with multiple objects, leading to object loss, attribute loss, or incomplete backgrounds.

Algorithm Architecture

VICTORIA consists of three main components:

Spatial consistency enforcement via self‑attention control.

Dependency‑based linguistic linking that injects syntactic relations into intermediate attention representations.

Cross‑attention mask extraction and conversion to preserve untouched regions.

Self‑Attention Control for Source Structure Retention

The query and key vectors from the self‑attention layer of the source image are extracted and swapped into the corresponding positions of the target generation process, ensuring structural fidelity.

Language Link Enhancement

Dependency parsing extracts modifier‑head word pairs from the input prompt, forming a set S. For each pair, the distance between their cross‑attention matrices is minimized (positive loss) while unrelated word pairs are pushed apart using a symmetric KL‑divergence (negative loss). An additional attention loss encourages high activation for head words, focusing attention on target objects.

Language‑Mixed Mask

For each editing word w, a mask is built that includes w and its related modifiers/heads. The mask guides the diffusion denoising process, balancing source and target latent codes while incorporating linguistic knowledge.

Algorithm Pseudocode

The combined techniques are summarized in the following pseudocode:

Experimental Results

VICTORIA successfully edits multiple objects, attributes, styles, scenes, and categories within a single image, as shown in the examples below.

Comparisons with other state‑of‑the‑art methods demonstrate that VICTORIA achieves finer alignment with textual prompts while preserving original structural details.

Quantitative evaluation on multiple benchmark datasets shows VICTORIA outperforms competitors on the CDS metric, indicating superior spatial structure retention and prompt‑consistent editing.

Paper Information

Title: Attentive Linguistic Tracking in Diffusion Models for Training‑free Text‑guided Image Editing

Authors: Bingyan Liu, Chengyu Wang, Jun Huang, Kui Jia

PDF: https://openreview.net/pdf?id=efTur2naAS

References

Rombach et al., High‑resolution image synthesis with latent diffusion models, CVPR 2022.

Hertz et al., Prompt‑to‑Prompt image editing with cross attention control, arXiv 2022.

Tumanyan et al., Plug‑and‑play diffusion features for text‑driven image‑to‑image translation, CVPR 2023.

Meng et al., SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, ICLR 2021.

Parmar et al., Zero‑shot image‑to‑image translation, ACM SIGGRAPH 2023.

Rassin et al., Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment, NeurIPS 2024.

Stable DiffusionDiffusion ModelsAI researchimage editingtext-guided editingVICTORIA
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.