Artificial Intelligence 10 min read

How DynamicFace Achieves High‑Quality, Consistent Face Swaps in Images and Video

DynamicFace introduces a novel face‑swapping framework that combines diffusion models with composable 3D facial priors, explicitly decoupling identity, pose, expression, lighting and background, achieving superior identity preservation and motion consistency across images and videos, as demonstrated by extensive qualitative and quantitative comparisons with SOTA methods.

AI Frontier Lectures

Sep 8, 2025

How DynamicFace Achieves High‑Quality, Consistent Face Swaps in Images and Video

Overview

DynamicFace is a new face‑swapping method that targets both image and video generation. It leverages diffusion models together with composable 3D facial priors to produce high‑quality results while maintaining strict consistency of identity and motion.

Key Challenges in Face Video Generation

Spatial‑Temporal Modeling Conflict: Existing models excel at spatial feature extraction for identity preservation but often entangle motion information with identity features, causing motion errors that amplify over video frames.

Identity Consistency Degradation: Large or rapid motions lead to facial deformation and loss of unique identity traits, reducing recognizability.

Overall Video Quality Loss: State‑of‑the‑art models still rely on post‑processing face‑swap tools that repair details but break visual continuity, producing a fragmented appearance.

Methodology

The core idea is to explicitly decompose facial conditioning into five independent representations: identity, pose, expression, illumination, and background. These are obtained via a 3D Morphable Model (3DMM) reconstruction pipeline.

Explicit Conditional Decoupling with Composable 3D Priors

For each source image, the identity shape parameters α are extracted. For each target video frame, pose β and expression θ are estimated. A shape‑pose normal map is rendered to suppress target identity leakage while preserving the source identity. Expression is derived from 2‑D keypoints, retaining only eyebrow, eye and lip motion priors. Illumination is obtained from a blurred UV texture, keeping low‑frequency lighting. Background conditioning uses an occlusion‑aware mask with random spatial shifts to align the target face during training and inference.

All four conditioned features are fused by a lightweight FusionNet and injected into the diffusion backbone through a Mixture‑of‑Guiders module (each guider consists of a 3×3 convolution followed by a zero‑initialized convolution). This preserves the pretrained Stable Diffusion prior while enabling precise control.

Identity‑Detail Dual‑Stream Injection

The architecture contains two parallel streams:

Identity Stream: A Face Former extracts an ID embedding via ArcFace, which is then injected into every U‑Net layer through cross‑attention with learnable query tokens, ensuring global identity consistency.

Detail Texture Stream: ReferenceNet, a trainable replica of the U‑Net, receives the 512×512 source latent variable via spatial‑attention, allowing fine‑grained texture transfer.

Plug‑and‑Play Temporal Consistency Module

To address frame‑wise stability, a temporal attention layer is inserted during training. For long videos, the authors propose FusionTVO, which splits the video into overlapping segments, assigns fusion weights, and blends the overlapping regions. A total‑variation loss in latent space further suppresses unnecessary frame‑to‑frame fluctuations. Background latent variables are swapped with those from the target image at each denoising step to keep scene fidelity.

Experiments

Qualitative Comparison

DynamicFace demonstrates superior identity preservation (shape and texture) and motion consistency (expression, eye gaze, pose) compared with GAN‑based methods (which often produce blurry results) and other diffusion‑based approaches (which may lose motion consistency). Visual examples show realistic, coherent face swaps in film, gaming, and e‑commerce scenarios.

Quantitative Evaluation

Benchmarks on FaceForensics++ (FF++) and FFHQ were conducted. Ten random frames per test video and 60 continuous frames for video‑level metrics were evaluated at 512×512 resolution using official pretrained weights of competing methods (Deepfakes, FaceShifter, MegaFS, SimSwap, DiffSwap, Face Adapter). DynamicFace achieved the highest scores on both identity retrieval (ID) and mouth‑&‑eye consistency, confirming its balanced performance.

Additional Applications

Beyond face swapping, DynamicFace can be combined with post‑processing pipelines for identity‑preserving human‑driven generation, further improving ID consistency and expression control. More results are available on the project website.

Conclusion

DynamicFace introduces a finely‑decoupled conditional injection mechanism that unifies high‑fidelity identity preservation with accurate motion reconstruction, setting a new state‑of‑the‑art for controllable face generation in both images and videos.