Artificial Intelligence 16 min read

Magic Mirror: Zero‑Shot Identity‑Preserved High‑Quality Personalized Video Generation

Magic Mirror introduces a single‑stage, zero‑shot framework that fuses dual facial embeddings with a conditional adaptive normalization module inside a Video Diffusion Transformer, achieving superior identity consistency, natural dynamics, and high visual quality compared with existing video generation methods.

AIWalker

Jan 15, 2025

Magic Mirror: Zero‑Shot Identity‑Preserved High‑Quality Personalized Video Generation

Highlights

Proposes Magic Mirror, a novel framework that generates identity‑consistent videos without per‑subject fine‑tuning.

Designs a lightweight cross‑modal adapter combined with Conditional Adaptive Normalization (CAN) to fuse facial embeddings into a full‑attention diffusion Transformer.

Develops a data‑construction pipeline that mixes synthetic data generation with progressive training to address the scarcity of personalized video data.

Problem Statement

Current video generation approaches struggle to balance identity (ID) consistency with natural motion. Existing methods either require subject‑specific fine‑tuning, produce static “copy‑paste” results, or suffer from instability in long‑sequence generation. Moreover, video diffusion models such as Video DiT sacrifice spatial fidelity for text‑video alignment, making fine‑grained identity features hard to retain, and high‑quality identity‑preserving image‑video pairs are scarce.

Proposed Solution

Magic Mirror is a single‑stage framework that generates high‑quality, identity‑preserving, dynamically natural videos. It introduces three key components:

Identity‑consistent synthetic data for initial training.

Fine‑tuning on video data to enforce temporal consistency.

Integration into the CogVideoX backbone.

Crucially, it incorporates Conditional Adaptive Normalization (CAN) to efficiently merge identity information.

Face Feature Extraction

The dual‑branch extractor captures high‑level identity features and structural cues from reference images. Two Q‑Former‑style perceivers attend to dense CLIP‑ViT feature maps, producing compressed embeddings that are merged via a decoupling mechanism and projected into the text‑embedding space with a fusion MLP.

Conditional Adaptive Normalization (CAN)

CAN adapts the per‑layer modulation factors of CogVideoX’s cross‑modal attention. Facial embeddings are injected as additional conditioning tokens, and the module learns scale, shift, and gating parameters to align the distribution of facial features with text and video streams. This design draws inspiration from conditional DiT and StyleGAN control methods.

Data and Training

Training proceeds in two stages. First, image pre‑training uses LAION‑Face and SFHQ datasets, augmented with PhotoMaker‑V2 generated identity‑conditioned pairs filtered by ArcFace cosine similarity. Video fine‑tuning leverages high‑quality Pexels, Mixkit, and a curated web‑scraped video set, each paired with synthetic facial frames. The loss combines identity‑aware denoising and generic diffusion objectives, with a balance factor applied to facial regions in 50 % of samples.

Implementation details: the adapter is inserted into every even‑indexed DiT layer of CogVideoX‑5B. Image pre‑training runs 30 K iterations (batch size 64), followed by 5 K video fine‑tuning iterations (batch size 8) on a node with eight NVIDIA A800 GPUs.

Experiments

Implementation Details

Dataset preparation follows the pipeline in Figure 5, using ArcFace for face detection and embedding extraction, and PhotoMaker‑V2 for reference frame synthesis. Text prompts (≈29 K) are generated by MiniGemini‑8B; CogVLM provides video captions for the second stage.

Quantitative Evaluation

Magic Mirror is compared against ID‑Animator, DynamiCrafter, CogVideoX, and EasyAnimate using VBench metrics (Dynamics, Text‑Video Alignment, Inception Score) and identity‑preserving scores (average similarity, FMref, FMinter). Results in Table 1 show superior performance across all metrics.

Qualitative Evaluation

Figure 6 demonstrates that Magic Mirror maintains higher text consistency, motion dynamics, and video quality than vanilla CogVideoX, and achieves better frame‑wise identity consistency than existing image‑to‑video pipelines.

Ablation Studies

Removing the facial embedding branch drastically reduces identity fidelity, confirming its importance. Excluding CAN leads to poorer cross‑frame identity retention, as shown in Figure 7. Training strategy ablations reveal that image pre‑training is essential for robust identity preservation, while video fine‑tuning ensures temporal coherence; training only on images introduces color‑shift artifacts during inference.

Discussion

Computational Overhead

Generating a 49‑frame 480p video incurs minimal additional GPU memory and latency compared with baseline models, as most extra parameters reside in the one‑time embedding extraction stage (see Table 3).

Feature Distribution Analysis

t‑SNE visualizations of CAN’s modulation scales (σ) across Transformer layers reveal distinct distribution patterns that are invariant to time steps, confirming effective conditioning.

Limitations and Future Work

Magic Mirror currently supports only single‑identity generation and focuses on facial features; extending to multi‑identity scenarios and fine‑grained attributes such as clothing remains an open challenge.

Conclusion

Magic Mirror presents a zero‑shot, identity‑preserving video generation framework that integrates dual facial embeddings and CAN into a DiT‑based architecture. Extensive experiments demonstrate high‑quality, personalized video synthesis that outperforms state‑of‑the‑art baselines on both objective benchmarks and human evaluations.

References

[1] Magic Mirror: ID‑Preserved Video Generation in Video Diffusion Transformers

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

video generation zero-shot identity preservation diffusion transformer conditional adaptive normalization

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.