DCEdit: Precise Text-Guided Image Editing that Preserves Backgrounds
DCEdit introduces a precise semantic localization strategy and a dual-level control mechanism for text‑guided image editing, delivering superior background preservation and editing quality, as demonstrated on the new RW‑800 benchmark and extensive comparisons with state‑of‑the‑art diffusion models.
Introduction
Text‑guided image editing faces two critical challenges: accurately locating the target semantics in the source image and editing the target without damaging the surrounding background. Existing diffusion‑based methods, especially those built on UNet, struggle with precise semantic positioning and often produce artifacts.
Proposed Method: DCEdit
Precise Semantic Localization (PSL)
The authors extract cross‑attention maps from each MM‑DiT layer of the FLUX model and refine them by fusing visual self‑attention and the inverse of the text self‑attention matrix. This process yields a high‑quality attention map that serves as a region clue for the target semantics.
Dual‑Level Control
Two control pathways are introduced:
Feature‑level control: The refined attention maps are injected into the feature space of FLUX, guiding the sampling process with spatially precise cues.
Latent‑level control: During diffusion sampling, latent variables are blended with the region clues using a quantile‑based binarization, preserving background consistency while allowing fine‑grained edits.
Inversion and Soft Fusion
The inversion stage derives the initial noise latent that corresponds to the source image. PSL is applied early (typically at step 10) to identify regions to edit. Instead of bluntly replacing features, a soft‑fusion mechanism weights the edited features by the optimized attention maps, preventing loss of early‑stage semantic information.
RW‑800 Benchmark
Inspired by PIE‑Bench, the authors construct RW‑800, a real‑world image‑editing benchmark containing high‑resolution images, detailed source and target prompts, and manually refined masks generated via Grounded‑SAM and interactive segmentation. The dataset includes ten editing types, adding a new “text editing” category to evaluate DiT’s ability to modify textual content in images.
Experiments
Quantitative Comparison on PIE‑Bench
DCEdit is compared with UNet‑based baselines (P2P, MasaCtrl, P2P‑zero, PnP, PnP‑Inv) and DiT‑based methods (RF‑Inv, Stable Flow, RF‑Edit, Fireflow). Results show that DCEdit runs plug‑and‑play with RF‑Edit and Fireflow, improving background consistency and edit quality without extra computational cost.
Quantitative Comparison on RW‑800
On RW‑800, DCEdit reduces background mean‑square error (MSE) by 20 % for RF‑Edit and 38 % for Fireflow, while maintaining or improving CLIP similarity scores. Structural similarity (SSIM) and IoU metrics also show consistent gains.
Qualitative Comparison
Visual examples illustrate that competing methods either alter the background significantly (RF‑Inv) or produce weak edits (Stable Flow), whereas DCEdit achieves strong, localized edits while keeping the original background intact.
Ablation Studies
Component‑wise ablations on RW‑800 reveal:
Feature‑level control alone improves editability but can increase structural distance if binary masks are inaccurate.
Using score maps instead of binary masks reduces structural distance while preserving edit strength.
Soft‑fusion retains feature integrity better than hard replacement.
Latent‑level control applied in the first two sampling steps dramatically enhances background consistency with minimal loss of editability; extending control to more steps further lowers structural distance and raises CLIP scores by up to 4 %.
Conclusion
DCEdit presents a novel dual‑level controlled image‑editing framework that leverages precise semantic localization to guide diffusion sampling. The method outperforms prior approaches on both background preservation and edit quality, and the newly introduced RW‑800 benchmark provides a rigorous testbed for future text‑guided editing research.
References
[1] DCEdit: Dual‑Level Controlled Image Editing via Precisely Localized Semantics
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
