DynamicFace: Controllable High‑Quality Face Swapping for Images and Video

DynamicFace introduces a diffusion‑based framework that explicitly decouples identity, pose, expression, illumination and background using composable 3D facial priors, achieving superior identity preservation, motion consistency and visual fidelity in both image and video face‑swapping tasks.

AIWalker
AIWalker
AIWalker
DynamicFace: Controllable High‑Quality Face Swapping for Images and Video

Face swapping is one of the most challenging tasks in AI‑generated video because human perception of faces is extremely sensitive to subtle changes in expression, pose, lighting and background. Existing methods often entangle identity and motion, leading to identity leakage, expression distortion, or temporal flickering.

Key Challenges

Spatial‑temporal modeling conflict: Image‑centric models excel at spatial identity extraction but couple motion information with target identity, causing inaccurate motion when extended to video diffusion.

Reduced identity consistency: Large pose or expression variations easily deform facial regions, breaking unique identity cues.

Overall video quality degradation: Post‑processing tools can repair key‑frame details but usually break visual continuity across frames.

DynamicFace Architecture

DynamicFace combines a diffusion model with composable 3D facial priors and introduces four fine‑grained, decoupled conditions: identity, pose, expression, illumination and background. The conditions are extracted as follows:

Identity shape parameters α are obtained from a 3DMM reconstruction of the source image.

Pose β and expression θ are extracted frame‑wise from the target video.

Illumination is derived from a blurred UV texture, keeping only low‑frequency lighting.

Background is represented by an occlusion‑aware mask and random spatial offsets.

All four condition maps are fed in parallel to a Mixture‑of‑Guiders module (three 3×3 convolutions followed by a zero‑conv layer). A FusionNet merges the condition features before injecting them into the diffusion backbone, preserving the pretrained Stable Diffusion prior while enabling precise control.

Identity‑Detail Dual‑Stream Injection

To guarantee high‑fidelity identity retention, DynamicFace uses two parallel streams:

Identity stream (Face Former): A 512‑dimensional ID embedding from ArcFace is transformed into learnable query tokens that attend to every U‑Net layer via cross‑attention, ensuring global identity consistency.

Detail stream (ReferenceNet): A trainable replica of the U‑Net receives the 512×512 source latent via spatial‑attention, injecting fine‑grained texture into the main network.

Plug‑and‑Play Temporal Consistency Module (FusionTVO)

During training, a temporal attention layer improves frame‑wise stability, but long‑video generation still suffers from jitter. FusionTVO splits the video into overlapping segments, assigns fusion weights to each segment, and blends the overlapping regions. A total‑variation loss in latent space suppresses unnecessary inter‑frame fluctuations, while background latents from the target image replace non‑facial regions at each denoising step.

Quantitative Evaluation

The authors evaluated DynamicFace on FaceForensics++ (FF++) and FFHQ, comparing against six state‑of‑the‑art face‑swapping methods: Deepfakes, FaceShifter, MegaFS, SimSwap, DiffSwap and Face Adapter. For each test video, 10 random frames were sampled for image‑level metrics and 60 consecutive frames for video‑level metrics, all at 512×512 resolution using official weights or public inference scripts.

DynamicFace achieved the highest scores on both identity retrieval and mouth‑&‑eye motion consistency, demonstrating superior preservation of identity and accurate motion reconstruction.

Ablation Studies

Four condition types (background, expression, illumination, shape‑pose normal map) were individually removed from the full model. Results on FF++ showed that each condition contributes uniquely: background ensures environmental consistency, expression locks micro‑movements, illumination maintains lighting harmony, and shape‑pose maps guarantee geometric fidelity. Removing any condition caused measurable degradation in the corresponding metric.

Additional ablations confirmed the necessity of both the Face Former and ReferenceNet modules—joint inclusion markedly improves identity injection performance. Likewise, the motion module and FusionTVO each provide clear gains in temporal consistency and overall video quality.

Qualitative Results

Visual comparisons illustrate that DynamicFace preserves facial shape, texture, expression and pose while maintaining background continuity. GAN‑based baselines produce blurry or identity‑drifted results; diffusion‑based baselines achieve higher resolution but suffer from inconsistent motion. DynamicFace’s fine‑grained condition injection yields coherent expression, eye‑gaze and pose across frames.

Further examples show post‑processing applications such as identity‑preserving body‑driven generation, where DynamicFace significantly improves face‑ID consistency and expression control.

DynamicFace therefore provides a scalable, controllable face‑swapping solution that outperforms current SOTA in identity fidelity, motion consistency and overall video quality, and its modular condition design offers a promising direction for future controllable AIGC research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Diffusion Modelsface swappingvideo synthesisControllable Generation3D facial priors
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.