Artificial Intelligence 20 min read

ImgEdit-Bench Exposes Weak Image Editing Models – A ‘Death Test’ Reveals Who’s Struggling

ImgEdit introduces a large‑scale, high‑quality editing dataset and the ImgEdit‑Bench benchmark, detailing a robust data‑generation pipeline, multi‑round editing tasks, and a specialized evaluation model, and demonstrates through extensive experiments that its ImgEdit‑E1 model outperforms existing open‑source editors and narrows the gap with closed‑source systems.

AIWalker

May 29, 2025

ImgEdit-Bench Exposes Weak Image Editing Models – A ‘Death Test’ Reveals Who’s Struggling

Highlights

Robust data pipeline – a high‑quality, diverse data‑generation workflow ensures the dataset supports a wide range of editing tasks.

New dataset – ImgEdit contains 1.2 M high‑quality edit pairs (1.1 M single‑round, 0.11 M multi‑round) covering 10 edit operations and 3 novel interaction types.

Comprehensive benchmark – ImgEdit‑Bench evaluates models on three dimensions (basic, challenging, multi‑round) with layered difficulty.

Advanced models – ImgEdit‑E1 surpasses existing open‑source editors; ImgEdit‑Judge aligns evaluation with human preference.

Problem Statement

Open‑source image‑editing models lag behind closed‑source counterparts due to a lack of high‑quality datasets and reliable benchmarks.

Existing datasets suffer from low resolution, simplistic prompts, small edit regions, inaccurate edits, concept imbalance, and imprecise filtering.

Complex editing scenarios (identity preservation, multi‑object manipulation, multi‑turn interaction) are insufficiently supported.

Current evaluation frameworks lack diversity, do not stratify task difficulty, over‑focus on edit‑type count, and ignore measurement accuracy.

Proposed Solution

ImgEdit dataset : 1.2 M edit pairs (1.1 M single‑round, 0.11 M multi‑round) covering 10 edit types (local, global, visual, mixed) and three multi‑turn challenges (content memory, understanding, version rollback).

Automated data‑construction workflow :

Multi‑stage filtering (aesthetic score > 4.75, short side ≥ 1280 px).

Open‑vocabulary detector + SAM2 for object‑level annotations.

GPT‑4o generates diverse single‑ and multi‑round edit prompts.

Task‑specific pipelines (e.g., SOTA generators) create edit pairs.

GPT‑4o evaluates edit‑pair quality.

ImgEdit‑E1 model : a visual‑language model (VLM) that ingests the reference image and edit instruction, combines a visual encoder with a DiT backbone, and is trained in two stages (MLP pre‑training → joint fine‑tuning with FLUX).

ImgEdit‑Bench benchmark : three test suites – basic test set (9 edit categories, 734 cases), UGE (understanding‑location‑edit) suite (47 challenging scenes), and multi‑turn suite (10 images × 3 dialogue rounds).

Evaluation metrics : instruction‑following, edit quality, detail preservation (1–5 scale by GPT‑4o); authenticity score using FakeShield for forgery detection.

ImgEdit‑Judge : a Qwen2.5‑VL‑7B model fine‑tuned on 200 k post‑processing scores to align with human preference, achieving ~70 % agreement with human judges.

Technical Details

The data source is LAION‑Aesthetics; after filtering, 600 k images are kept. GPT‑4o rewrites captions and extracts editable objects. Open‑vocabulary detection produces bounding boxes, refined to masks by SAM2. CLIPScore and area ratio filter out low‑similarity or tiny regions (e.g., background‑replace requires > 40 % area). For dynamic‑change edits, 160 k frames from Open‑Sora Plan are annotated with actions by GPT‑4o.

Instruction generation conditions on image description, edit type, bounding box, and target object, forcing the language model to embed spatial information. Multi‑turn prompts are generated with few‑shot examples and limited to 2–3 rounds per turn.

FLUX and SDXL serve as base generators; IP‑Adapters and ControlNet provide precise control. Post‑processing filters use object area, CLIPScore, and aesthetic score, followed by GPT‑4o fine‑grained scoring per edit type.

Dataset Statistics

ImgEdit contains 1.2 M edit pairs (including 110 k multi‑round samples), covering 13 edit categories, with an average short‑side resolution of 1280 px and 8.7 k unique instruction tokens. Human‑verified edit accuracy (sampled 1 k pairs) is the highest among comparable datasets. Pixel‑level difference analysis shows significantly larger local edit regions than prior datasets.

Benchmark Construction

Basic test set: 734 cases across add, remove, modify, replace, style transfer, background replace, dynamic change, mixed edit, and matting. Each instruction is initially generated by GPT‑4o and manually filtered.

UGE suite: 47 complex scenes with challenges such as occlusion, multiple similar instances, camouflage, and rare objects, demanding spatial reasoning and multi‑object coordination.

Multi‑turn suite: evaluates content memory, contextual understanding, and version rollback across three dialogue rounds per image.

Experimental Evaluation

Setup

Closed‑source baseline: GPT‑4o‑Image (Gemini‑2.0‑Flash unavailable via API). Open‑source baselines: Step1X‑Edit, Ultra‑Edit, AnySD, MagicBrush, InstructPix2Pix, and ImgEdit‑E1. Architectures: ImgEdit‑E1 and Step1X‑Edit use VLM + DiT; others rely on UNet + pre‑trained text encoders (AnySD adds MoE).

Resolution: UltraEdit/AnySD output 512×512; others output 1024×1024. Each model runs three independent trials; averages reported. Multi‑turn tests only support GPT‑4o‑Image and Gemini‑2.0‑Flash.

Results

Quantitative : GPT‑4o‑Image leads across all metrics; ImgEdit‑E1 and Step1X‑Edit are the strongest open‑source models, with ImgEdit‑E1 excelling in object extraction and mixed‑edit tasks.

ImgEdit‑E1 shows balanced performance and superior scores on object‑extraction and mixed edits.

Step1X‑Edit matches ImgEdit‑E1 overall but lags on background‑replace and attribute‑modification.

AnySD performs averagely, likely due to broad but low‑quality training data.

UltraEdit fails on removal tasks due to missing training data.

MagicBrush and InstructPix2Pix suffer from artifacts and instruction mis‑alignment.

All models receive high “fake scores,” indicating current forensic detectors can still spot synthetic content.

Multi‑turn : Only GPT‑4o‑Image and Gemini‑2.0‑Flash handle version‑rollback within two turns; other models struggle with content memory and understanding.

Qualitative : Representative cases show ImgEdit‑E1 and GPT‑4o‑Image successfully preserve details (e.g., changing bicycle color while keeping snow) and perform object extraction, whereas other models produce blurry or incorrect results.

Discussion

The benchmark identifies three key factors influencing editing performance: instruction understanding (driven by the text encoder), region localization (requiring precise spatial cues), and edit execution (dependent on data quality and diversity). ImgEdit‑E1’s strong VLM encoder and high‑quality training data explain its advantage.

Conclusion

The ImgEdit framework advances image‑editing research by delivering a high‑quality dataset, a robust data‑generation pipeline, and a comprehensive benchmark. ImgEdit‑E1 validates the framework’s effectiveness, and ImgEdit‑Bench provides actionable insights for future model design, narrowing the gap between open‑source and state‑of‑the‑art closed‑source editors.

References

[1] ImgEdit: A Unified Image Editing Dataset and Benchmark.

AI benchmark dataset image editing Vision Language Model

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.