IMAGPose: A Unified Conditional Framework for Photo‑Realistic Pose‑Guided Person Generation (NeurIPS 2024)

IMAGPose introduces a unified conditional diffusion framework that combines feature‑level, image‑level, and cross‑view attention modules to generate high‑fidelity, photo‑realistic person images under diverse pose and multi‑view scenarios, outperforming prior SOTA methods on DeepFashion and Market‑1501.

AIWalker
AIWalker
AIWalker
IMAGPose: A Unified Conditional Framework for Photo‑Realistic Pose‑Guided Person Generation (NeurIPS 2024)

Introduction

Pose‑guided person image generation aims to transform a source person image into a target image with a specified pose while preserving appearance. Applications include virtual reality, film production, and e‑commerce, and the generated images can boost downstream tasks such as person re‑identification.

Method Overview

We propose IMAGPose , a unified conditional diffusion framework that addresses two overlooked user scenarios: (1) generating multiple target images with different poses from a single source, and (2) generating target images from multiple source views. IMAGPose consists of three core modules:

Feature‑Level Conditioning (FLC) : combines low‑level texture features from a frozen Variational Auto‑Encoder (VAE) encoder with high‑level semantic features from a frozen image encoder. A learnable tokenizer (2‑D convolutions followed by flattening) aligns token dimensions, and the concatenated features form a rich appearance representation.

Image‑Level Conditioning (ILC) : injects a variable number of source‑image conditions and employs a mask strategy to align images and poses. Masks are binary (0 = masked, 1 = unmasked) and the input channel count becomes 9 (4 + 4 + 1). This enables flexible handling of single‑source/multi‑pose or multi‑source/single‑pose settings.

Cross‑View Attention (CVA) : decomposes global and local cross‑attention. After the standard cross‑attention in the denoising UNet, CVA splits the feature map into four local person patches, adds a temporal dimension, learns attention across patches, and merges them back, ensuring local fidelity and global consistency.

During training, we use classifier‑free guidance [15] with guidance scale γ. The loss combines the diffusion objective with pose conditioning, where F, I, and P denote features from FLC, ILC, and the pose encoder respectively.

Experiments

Datasets and Metrics

We evaluate on DeepFashion (52,712 high‑resolution fashion images) and Market‑1501 (32,668 low‑resolution images) using OpenPose for skeleton extraction. Objective metrics: SSIM, LPIPS, and FID. Subjective metrics: R2G (real‑to‑generated), G2R (generated‑to‑real), and Jab (percentage of images judged superior).

Implementation Details

Experiments run on eight NVIDIA V100 GPUs. We start from pretrained Stable Diffusion V1.5, modify the first convolution to accept nine channels, and use Dinov2‑G/14 as the image encoder. Training uses AdamW with learning rate 1e‑4 for 300k steps, batch size 4, and 1000 diffusion timesteps. Inference employs a 20‑step DDIM sampler with guidance scale 2.0.

Quantitative Results

Table 1 (DeepFashion) shows IMAGPose surpassing all baselines, including GAN‑based ADGAN and diffusion‑based CFLD, especially on SSIM where it outperforms ADGAN by a large margin. On Market‑1501, IMAGPose again leads in SSIM, LPIPS, and FID, beating NTED thanks to VAE‑derived texture details and outperforming PCDMs despite their refinement stage.

Qualitative Results

Figure 5 compares IMAGPose with SOTA methods on challenging pose changes. Diffusion‑based baselines capture coarse clothing texture, but IMAGPose preserves finer details. In extreme pose transformations, competing methods hallucinate artifacts (e.g., misplaced hands), whereas IMAGPose maintains correct alignment and realistic textures.

User Study

A study with 50 volunteers evaluated R2G, G2R, and Jab. Participants rated IMAGPose‑generated images as real 18.4 % more often than the second‑best model, and IMAGPose achieved a Jab score of 42.3 %.

Consistency Across Scenarios

We test three inference settings: T1 (single source, single pose), T2 (single source, multiple poses), and T3 (multiple sources, single pose). All three achieve competitive metrics, with T3 reaching SSIM 0.7727, LPIPS 0.1172, and FID 5.33, demonstrating that a single training run supports diverse user needs. Speed tests show IMAGPose is ~8× faster than PIDM and ~3× faster than PoCoLD/CFLD.

Ablation Study

Table 2 evaluates module contributions. Removing CVA (B0) drops performance; adding CVA (B1) improves SSIM, LPIPS, and FID. Incorporating VAE texture features (B2) further raises SSIM by 0.0125, and adding ILC (B3) yields another 0.0172 gain. The full combination (B4) attains the best scores, confirming that each module adds complementary benefits.

Model Variants

Table 3 explores different backbone diffusion models and image encoders. Switching the backbone from Stable Diffusion V1.5 to V2.1 yields modest gains, while Dinov2‑G/14 slightly outperforms other encoders, indicating that the core diffusion model and encoder have limited impact compared to the conditioning modules.

Downstream Application

We assess the utility of IMAGPose‑generated images for person re‑identification. By augmenting Market‑1501 with synthetic images, BoT‑based re‑ID improves its rank‑1 accuracy, demonstrating that the generated data are beneficial for downstream tasks.

Conclusion

IMAGPose presents a unified conditional diffusion framework that integrates feature‑level, image‑level, and cross‑view attention conditioning to generate photo‑realistic person images under multiple poses and view configurations. Extensive quantitative and qualitative evaluations confirm its superiority over existing SOTA methods, its consistency across diverse scenarios, and its practical value for downstream vision tasks.

Figures

IMAGPose overview
IMAGPose overview
Framework diagram
Framework diagram
Cross‑View Attention
Cross‑View Attention
Quantitative results
Quantitative results
Qualitative comparison
Qualitative comparison
User study results
User study results
Scenario comparison
Scenario comparison
Speed and performance
Speed and performance
Consistency visualization
Consistency visualization
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionAIDiffusion Modelsimage synthesispose-guided generation
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.