PixelHacker: Diffusion‑Based Image Inpainting with Latent Class Guidance Beats SOTA
PixelHacker introduces a latent class guidance (LCG) paradigm that injects foreground and background embeddings into a diffusion model, training on 14 million image‑mask pairs and achieving state‑of‑the‑art structural and semantic consistency across Places2, CelebA‑HQ and FFHQ benchmarks.
Highlights
Introduces Latent Class Guidance (LCG), a simple yet effective paradigm that encodes foreground and background features with two fixed‑size embeddings and intermittently injects them via linear attention during denoising.
Proposes PixelHacker, a diffusion‑based inpainting model trained on 14 M image‑mask pairs, fine‑tuned on open‑source benchmarks and consistently outperforming existing SOTA methods in structural and semantic consistency.
PixelHacker achieves superior results on Places2, CelebA‑HQ and FFHQ across multiple evaluation settings.
Problem Statement
Current image‑inpainting methods struggle with complex structures (textures, shapes, spatial relations) and semantics (color consistency, object restoration, logical correctness), often producing artifacts and incoherent generations.
Proposed Solution
We design a simple yet powerful repair paradigm called Latent Class Guidance (LCG) and build a diffusion model named PixelHacker that leverages LCG.
Latent Class Guidance Construction
LCG uses only two broad categories—"foreground" and "background"—instead of explicit class labels. Masks are divided into four types (object semantic, scene semantic, random brush, random object) and assigned to the corresponding embeddings. The process is illustrated in Figure 3.
During training, the model receives a noisy image, a clean mask, and the corresponding clean image, concatenates them, and encodes them with a VAE encoder into latent space. Two fixed‑size embeddings encode latent foreground and background features. Linear attention intermittently injects these embeddings into the denoising pipeline, enabling structural and semantic interaction.
Structural & Semantic Interaction
In each denoising step, gated linear attention (GLA) computes self‑attention on normalized latent features, followed by a standard transformer block (residual connection, layer‑norm, cross‑attention, MLP). The LCG embeddings are introduced via cross‑attention, allowing the latent features to decode jointly with the guidance. Once an embedding is injected, all subsequent decoding steps retain the guidance.
Experiments
Dataset and Evaluation
We construct a 14 M image‑mask dataset with 116 foreground and 21 background categories using automatic annotation (AlphaCLIP + SAM) on COCONutLarge (360 k), Object365V2 (2.02 M), GoogleLandmarkV2 (4.13 M) and a self‑collected natural‑scene set (7.49 M). The final split contains 4.3 M foreground pairs and 9.7 M background pairs.
Implementation Details
Training uses 12 NVIDIA L40S GPUs, batch size 528, resolution 512×512, 200 k iterations on the full dataset. Fine‑tuning on Places2 (1.8 M images), CelebA‑HQ and FFHQ follows the protocols of prior works [14, 22, 25, 40]. AdamW (β₁=0.9, β₂=0.999) with a learning rate of 1e‑5 is employed. The SDXL‑VAE encoder is frozen.
Results on Places2
Following three evaluation settings from prior SOTA papers, PixelHacker achieves the best FID (8.59) and LPIPS (0.2026) among all methods, even surpassing SDXL‑Inpainting. The zero‑shot version also ranks second‑best in FID, demonstrating strong generalisation.
Results on CelebA‑HQ and FFHQ
PixelHacker consistently attains SOTA scores on both datasets, confirming its ability to transfer from natural scenes to facial imagery. Qualitative examples show clear, structurally coherent faces without artifacts.
Ablation Studies
Mask composition: Experiments varying the inclusion of object‑semantic, scene‑semantic, random‑brush and random‑object masks show that using all four yields the best performance on a 3 k‑image subset of Places2.
Embedding dimension: Comparing dimensions 20, 64, 256, 1024 reveals no significant gain beyond 20, so E.Dim = 20 is kept as default.
Guidance scale: Guiding scales from 1.0 to 4.0 are tested on CelebA‑HQ; scale 2.0 provides the optimal trade‑off between fidelity and diversity.
Conclusion
We present Latent Class Guidance (LCG), a straightforward paradigm that injects foreground and background embeddings into a diffusion model to achieve structural and semantic consistency. PixelHacker, built on LCG and trained on 14 M image‑mask pairs, consistently outperforms existing SOTA methods across multiple benchmarks, demonstrating the effectiveness and data efficiency of the proposed approach.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
