Artificial Intelligence 17 min read

TurboFill: High‑Quality Image Inpainting in Just 4 Steps

TurboFill introduces a fast image‑inpainting model that trains a repair adapter on a few‑step text‑to‑image diffusion backbone, achieving state‑of‑the‑art results with only four diffusion steps while dramatically reducing computational cost.

AIWalker

Apr 7, 2025

TurboFill: High‑Quality Image Inpainting in Just 4 Steps

Overview

Image inpainting has progressed rapidly with diffusion models, but standard multi‑step approaches are computationally expensive. TurboFill addresses this by augmenting a few‑step text‑to‑image diffusion model (DMD2) with a dedicated repair adapter, delivering high‑quality results in only four diffusion steps.

Method

The system comprises three components: a slow generator (SDXL + adapter), a fast generator (DMD2 + adapter), and a diffusion discriminator. The adapter shares weights between the two generators. Training alternates among three steps:

Update the adapter in the slow generator using diffusion loss.

Train the adapter in the fast generator with adversarial loss that minimizes the distance between real and fake latent distributions.

Train the diffusion discriminator while keeping the adapter fixed, using a combination of GAN loss, diffusion loss, and a background‑preserving reconstruction loss.

The discriminator consists of an SDXL encoder, an auxiliary encoder (mirroring the adapter architecture), and a convolutional classifier. The auxiliary encoder processes the concatenated noisy latent, down‑sampled binary mask, and masked‑image latent, feeding its features into the SDXL encoder. The classifier maps the final feature map to a scalar used for the GAN loss.

Datasets and Metrics

Two new benchmarks are introduced:

DilationBench : 300 mask‑prompt pairs generated by randomly dilating segmentation masks.

HumanBench : 150 manually annotated mask‑prompt pairs.

Evaluation uses four metrics: Q‑Align, CLIPIQA+, TOPIQ, and CLIP similarity, all higher‑is‑better. Both masked‑region and whole‑image quality are measured.

Experimental Setup

All experiments run on eight A100 GPUs (40 GB). Training uses AdamW, a learning rate of 1e‑5, batch size 2, gradient accumulation 4, and mixed‑precision. For SDXL the DDIM scheduler samples 1000 timesteps; for DMD2 the LCM scheduler samples four specific timesteps.

Quantitative Results

TurboFill is compared against BLD, HD‑Painter, SDXL‑Inpainting, BrushNet‑Rand, Power‑Paint, and a version of BrushNet trained on the LocalCaptionData (BrushNet*). On both DilationBench and HumanBench, TurboFill with four steps outperforms the 50‑step baselines on Q‑Align, CLIPIQA+, and TOPIQ, demonstrating superior mask‑region quality, whole‑image quality, and text‑alignment.

BrushNet* improves text‑alignment by 3.75 (50‑step) and 3.52 (4‑step) points over the original BrushNet, confirming that training with local caption data enhances semantic consistency.

Qualitative Comparison

Visual inspection (Figures 4‑6) shows that TurboFill preserves fine details and texture without the over‑sharpening artifacts seen in Power‑Paint V2 or the color‑saturation issues of BrushNet‑4‑step. The method also avoids the unrealistic artifacts (e.g., double‑headed animals) produced by other approaches.

Ablation Studies

Removing any of the three loss components—GAN loss, diffusion loss, or background‑preserving loss—degrades all metrics and introduces visible artifacts such as color mismatches at mask boundaries or incoherent scene composition, confirming the necessity of the combined three‑step adversarial training.

Conclusion

TurboFill demonstrates that a few‑step diffusion backbone equipped with a repair adapter and a three‑step adversarial training scheme can achieve state‑of‑the‑art image inpainting with dramatically lower inference cost. The introduced DilationBench and HumanBench provide reliable evaluation for future fast‑inpainting research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Computer Vision diffusion models image inpainting few‑step generation TurboFill

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.