DynamicFace: Composable 3D Facial Priors for High‑Quality, Consistent Face Swaps
DynamicFace introduces a controllable face‑swapping framework that leverages composable 3D facial priors, dual‑stream identity injection, and a FusionTVO module to achieve superior image and video quality, identity preservation, and temporal consistency, outperforming existing state‑of‑the‑art methods on benchmark datasets.
Introduction
Face swapping is a challenging yet crucial task in AI‑generated video because human perception of faces is extremely sensitive to identity, expression, pose, lighting, and background. Precise controllability is essential for practical use, and existing methods often struggle to preserve identity while accurately reproducing target expressions and motion.
Key Contributions
Precise Control : Four decoupled fine‑grained conditions derived from a 3D facial prior enable independent semantic control.
High Fidelity : Identity and detail are injected via Face Former and ReferenceNet, preserving both global identity and fine textures.
High Consistency : The FusionTVO module enhances temporal stability and background coherence across video frames.
Method Overview
DynamicFace combines a diffusion model with composable 3D facial priors. The facial condition is explicitly split into five independent representations: identity, pose, expression, illumination, and background. Parameters are obtained from a 3DMM reconstruction.
These four conditions are processed in parallel by a Mixture‑of‑Guiders network (3×3 convolutions followed by zero‑padding). After fusion by FusionNet, the combined features are injected into the diffusion backbone.
Explicit Condition Decoupling
Identity shape parameters α are extracted from the source image; pose β and expression θ are extracted frame‑wise from the target video. Shape‑pose normal maps are rendered to reduce identity leakage. Expression priors retain only eyebrow, eye, and lip motions. Illumination is derived from a blurred UV texture, keeping low‑frequency lighting. Background is handled with an occlusion‑aware mask and random shift strategy, enabling alignment during training and inference.
Identity‑Detail Dual‑Stream Injection
High‑level identity is injected by Face Former: a 512‑dimensional ID embedding from ArcFace interacts with U‑Net layers via cross‑attention. Fine‑grained texture is injected by ReferenceNet, a trainable copy of U‑Net that feeds spatial‑attention‑augmented source latent variables into the main network.
Plug‑and‑Play Temporal Consistency Module
During training, a temporal attention layer improves frame‑wise stability. For long videos, FusionTVO divides the sequence into segments, applies weighted blending in overlapping regions, and adds a total‑variation loss in latent space to suppress unnecessary fluctuations. Background latent variables are swapped during each denoising step to maintain scene fidelity.
Experiments
Quantitative evaluation on FaceForensics++ (FF++) and FFHQ compares DynamicFace with six recent face‑swapping methods (Deepfakes, FaceShifter, MegaFS, SimSwap, DiffSwap, Face Adapter). Ten random frames per video are used for image‑level metrics and 60 consecutive frames for video‑level metrics. DynamicFace achieves the best scores in both identity retrieval and mouth/eye consistency, demonstrating superior identity preservation and motion fidelity.
Ablation Studies
Four condition groups (background, expression, illumination, shape‑pose normal map) are removed one at a time. Results show each condition contributes uniquely: background ensures environmental consistency, expression locks micro‑movements, illumination maintains lighting harmony, and shape‑pose maps guarantee geometric fidelity. Removing any condition degrades corresponding metrics and visual quality.
Further ablations confirm the necessity of the motion module and FusionTVO for temporal consistency, and the combined use of FaceFormer and ReferenceNet for strong identity injection.
Results Showcase
Visual comparisons illustrate that DynamicFace preserves identity (shape and texture) and motion (expressions, pose) while maintaining background consistency, outperforming GAN‑based methods that produce blurry results and diffusion‑based methods that struggle with motion consistency.
Conclusion
DynamicFace presents a diffusion‑based video face‑swapping framework that explicitly decouples 3D facial priors, injects identity and detail through a dual‑stream mechanism, and employs FusionTVO for temporal and background consistency. Extensive quantitative and ablation experiments on FF++ demonstrate state‑of‑the‑art performance in identity and motion consistency, offering a promising direction for controllable generative AI.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
