ICRDrag: The First In‑Context Region Drag Model for Precise, Controllable Image Editing
ICRDrag, presented at ECCV 2026, introduces an in‑context region‑dragging framework that uses mask‑based attention and bidirectional source‑target constraints to achieve precise, natural image edits while overcoming the deformation and boundary issues of earlier point‑ and region‑drag methods.
ICRDrag (In‑Context Region‑based Drag) is a context‑aware region‑dragging model that lets users select a source region and a target region with masks, then moves, scales, or deforms the source region while preserving surrounding details. The demo shows a source image with a blue mask (region to move) and a red mask (target location); dragging the source to the target moves the object, keeps ancillary parts (e.g., mouth and chin) consistent, and minimizes unnecessary changes. The online demo supports up to five source‑target pairs and allows adding anchor masks to lock unaffected areas.
Technical contributions
Context learning framework: Built on DiT, the model receives the original image, source mask, and target mask in a single forward pass and directly outputs the edited image.
Image‑mask attention consistency: The attention map of the generated image must align with the spatial distribution of the target mask, ensuring strict adherence to the defined region.
Source‑target bidirectional attention: The target region attends to the corresponding source region and vice‑versa, establishing a clear correspondence between pre‑ and post‑edit objects.
Separate LoRA modules for image and mask: Independent LoRA adapters are trained for each modality because images contain rich texture while masks encode only shape.
Two‑stage progressive training: Stage 1 uses complete semantic masks to teach basic region‑transformation logic; Stage 2 introduces randomly expanded, coarse masks to simulate hand‑drawn selections, dramatically improving tolerance to imperfect user input.
Dataset and evaluation
To train ICRDrag, the authors constructed the PRD (Paired Region Dataset) from the million‑scale video collection OpenVid, yielding 287,000 triplets of original image + source mask + target image + target mask . For evaluation, PRDBench provides 1,000 manually verified high‑quality samples with masks and keypoints, enabling fair comparison between point‑drag and region‑drag models.
Resources
Paper: https://arxiv.org/pdf/2606.25907
GitHub: https://github.com/bcmi/ICRDrag-Region-Drag-Editing
Demo: https://drag.ustcnewly.com/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
