PixelHacker: Diffusion‑Based Image Inpainting with Latent Class Guidance Beats SOTA

PixelHacker introduces a latent class guidance (LCG) paradigm that injects foreground and background embeddings into a diffusion model, training on 14 million image‑mask pairs and achieving state‑of‑the‑art structural and semantic consistency across Places2, CelebA‑HQ and FFHQ benchmarks.

AIWalker
AIWalker
AIWalker
PixelHacker: Diffusion‑Based Image Inpainting with Latent Class Guidance Beats SOTA

Highlights

Introduces Latent Class Guidance (LCG), a simple yet effective paradigm that encodes foreground and background features with two fixed‑size embeddings and intermittently injects them via linear attention during denoising.

Proposes PixelHacker, a diffusion‑based inpainting model trained on 14 M image‑mask pairs, fine‑tuned on open‑source benchmarks and consistently outperforming existing SOTA methods in structural and semantic consistency.

PixelHacker achieves superior results on Places2, CelebA‑HQ and FFHQ across multiple evaluation settings.

Problem Statement

Current image‑inpainting methods struggle with complex structures (textures, shapes, spatial relations) and semantics (color consistency, object restoration, logical correctness), often producing artifacts and incoherent generations.

Proposed Solution

We design a simple yet powerful repair paradigm called Latent Class Guidance (LCG) and build a diffusion model named PixelHacker that leverages LCG.

Latent Class Guidance Construction

LCG uses only two broad categories—"foreground" and "background"—instead of explicit class labels. Masks are divided into four types (object semantic, scene semantic, random brush, random object) and assigned to the corresponding embeddings. The process is illustrated in Figure 3.

Mask assignment for LCG
Mask assignment for LCG

During training, the model receives a noisy image, a clean mask, and the corresponding clean image, concatenates them, and encodes them with a VAE encoder into latent space. Two fixed‑size embeddings encode latent foreground and background features. Linear attention intermittently injects these embeddings into the denoising pipeline, enabling structural and semantic interaction.

Structural & Semantic Interaction

In each denoising step, gated linear attention (GLA) computes self‑attention on normalized latent features, followed by a standard transformer block (residual connection, layer‑norm, cross‑attention, MLP). The LCG embeddings are introduced via cross‑attention, allowing the latent features to decode jointly with the guidance. Once an embedding is injected, all subsequent decoding steps retain the guidance.

GLA interaction diagram
GLA interaction diagram

Experiments

Dataset and Evaluation

We construct a 14 M image‑mask dataset with 116 foreground and 21 background categories using automatic annotation (AlphaCLIP + SAM) on COCONutLarge (360 k), Object365V2 (2.02 M), GoogleLandmarkV2 (4.13 M) and a self‑collected natural‑scene set (7.49 M). The final split contains 4.3 M foreground pairs and 9.7 M background pairs.

Implementation Details

Training uses 12 NVIDIA L40S GPUs, batch size 528, resolution 512×512, 200 k iterations on the full dataset. Fine‑tuning on Places2 (1.8 M images), CelebA‑HQ and FFHQ follows the protocols of prior works [14, 22, 25, 40]. AdamW (β₁=0.9, β₂=0.999) with a learning rate of 1e‑5 is employed. The SDXL‑VAE encoder is frozen.

Results on Places2

Following three evaluation settings from prior SOTA papers, PixelHacker achieves the best FID (8.59) and LPIPS (0.2026) among all methods, even surpassing SDXL‑Inpainting. The zero‑shot version also ranks second‑best in FID, demonstrating strong generalisation.

Quantitative comparison on Places2 (setting 1)
Quantitative comparison on Places2 (setting 1)

Results on CelebA‑HQ and FFHQ

PixelHacker consistently attains SOTA scores on both datasets, confirming its ability to transfer from natural scenes to facial imagery. Qualitative examples show clear, structurally coherent faces without artifacts.

Qualitative results on CelebA‑HQ
Qualitative results on CelebA‑HQ
Qualitative results on FFHQ
Qualitative results on FFHQ

Ablation Studies

Mask composition: Experiments varying the inclusion of object‑semantic, scene‑semantic, random‑brush and random‑object masks show that using all four yields the best performance on a 3 k‑image subset of Places2.

Embedding dimension: Comparing dimensions 20, 64, 256, 1024 reveals no significant gain beyond 20, so E.Dim = 20 is kept as default.

Guidance scale: Guiding scales from 1.0 to 4.0 are tested on CelebA‑HQ; scale 2.0 provides the optimal trade‑off between fidelity and diversity.

Conclusion

We present Latent Class Guidance (LCG), a straightforward paradigm that injects foreground and background embeddings into a diffusion model to achieve structural and semantic consistency. PixelHacker, built on LCG and trained on 14 M image‑mask pairs, consistently outperforms existing SOTA methods across multiple benchmarks, demonstrating the effectiveness and data efficiency of the proposed approach.

computer visionDiffusion Modelsimage inpaintingSOTAlatent class guidancePixelHacker
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.