How SuperEdit Boosts Instruction-Based Image Editing with Rectified Supervision
SuperEdit introduces rectified instruction generation and contrastive supervision to fix noisy supervision in instruction‑based image editing, achieving up to 9.19% performance gains on Real‑Edit benchmarks without extra model parameters or pre‑training, and releases all data and code publicly.
Summary Overview
Problem
Noisy supervision signal : Existing instruction‑based editing datasets contain mismatched edit instructions and original‑edited image pairs, leading to large supervision noise.
Complex scene editing difficulty : Models struggle with multi‑object, quantity, position, or relational edits.
Reliance on extra modules : Prior methods need visual‑language models (VLMs), pre‑training tasks, or complex architectures, increasing computational cost without solving the noise issue.
Proposed Solution
Rectified Instructions : Use a VLM (e.g., GPT‑4o) together with diffusion‑model priors to generate edit instructions that better match the original‑edited image pair.
Contrastive Supervision : Build positive‑negative instruction pairs and apply a triplet loss so the model learns from both correct and deliberately corrupted instructions.
Technical Components
Visual Language Model (VLM) : GPT‑4o is employed to analyse image differences and produce refined instructions.
Diffusion Prior Knowledge : Early diffusion steps capture layout, middle steps capture shape/color, and late steps capture fine details; these attributes guide VLM instruction correction.
Triplet Loss : A contrastive loss distinguishes correct from incorrect instructions, improving the model's understanding of subtle command differences.
Achieved Effects
Significant performance boost : 9.19% improvement on Real‑Edit without extra parameters or pre‑training.
Simplified architecture : No additional VLM module or pre‑training task is required during inference.
Open‑source contribution : All data and models are released to the community.
Evaluation advantage : Both GPT‑4o automatic scores and human evaluations surpass previous SOTA (SmartEdit).
Method
The baseline follows the InstructPix2Pix framework, which conditions a diffusion model on both the original image and an edit instruction to generate the edited image. During training, a random timestep t is sampled, noise is added to the edited image, and the model learns to predict this noise. The loss can be expressed as: concat(noisy_edited_image, original_image_latent) where concat denotes channel‑wise concatenation of the noisy edited image and the original image latent representation.
Diffusion Prior Supervision Correction
Existing pipelines only use two steps: (1) generate a textual edit prompt, (2) synthesize the edited image with a diffusion model. This often yields mismatched instruction‑image pairs. By analysing diffusion timesteps, we discover that early steps focus on layout, middle steps on object attributes, and late steps on details. These priors are used to guide GPT‑4o in producing more accurate edit instructions that respect the original‑edited pair.
Contrastive Supervision with Paired Instructions
To teach the model fine‑grained differences, we construct negative instructions by altering a single attribute (e.g., quantity, spatial relation) of the rectified instruction while keeping the rest unchanged. The model receives both positive and negative instructions during training and is trained with a triplet loss:
The final training loss combines the original diffusion loss with the triplet loss.
Experiments
Data Collection
We aggregate 40,000 training pairs from InstructPix2Pix, MagicBrush, and Seed‑Data‑Edit, balancing various edit types. For MagicBrush we use existing human‑verified instructions; for Seed‑Data‑Edit we only use the image pairs. All data undergo rectified instruction generation and contrastive supervision construction.
Experimental Setup
Evaluation uses the Real‑Edit benchmark with three metrics: Following (instruction adherence), Preserving (non‑edited region retention), and Quality (overall aesthetic score). Both GPT‑4o automatic scoring and human evaluation are performed.
Experimental Results
On Real‑Edit, SuperEdit outperforms SmartEdit on all three metrics, achieving an 11.4% higher overall score despite using 1/30 of the training data and 1/13 of the model parameters. Human evaluators (15 participants) confirm these gains, reporting 1.8%–16% improvements across metrics.
Ablation Studies
Instruction correction vs. contrastive supervision : Using only rectified instructions improves scores by 0.95, 0.79, and 0.11 points on the three metrics; adding contrastive supervision yields additional gains of 0.19 and 0.08 points on Following and Preserving without affecting Quality.
Data scale : Performance scales with data size; 5k samples already achieve reasonable results, while 40k samples provide the best scores, indicating no saturation yet.
Conclusion
By focusing on higher‑quality supervision rather than architectural changes, SuperEdit demonstrates that rectified instructions and contrastive learning can substantially improve instruction‑based image editing. The approach achieves state‑of‑the‑art results with fewer data, no extra modules, and open‑source releases, offering a valuable new direction for future research.
References
SuperEdit: Rectifying and Facilitating Supervision for Instruction‑Based Image Editing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
