DualReal: Seamless Identity and Motion Customization for Video Generation

DualReal introduces a novel adaptive joint training framework that simultaneously customizes subject identity and motion dynamics in video generation, overcoming the conflicts of traditional isolated approaches by using a dual-domain perception adapter and stage-fusion controller, achieving up to 31.8% improvement on CLIP‑I and DINO‑I metrics.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
DualReal: Seamless Identity and Motion Customization for Video Generation

Video Overview

Two illustrative screenshots showcase the visual quality of videos generated by DualReal.

Video overview image 1
Video overview image 1
Video overview image 2
Video overview image 2

Abstract

Traditional dual‑mode video customization follows an "isolated" paradigm, customizing either identity or motion separately and ignoring their intrinsic constraints, which leads to conflicts and systematic degradation. The University of Science and Technology of China team proposes DualReal , a new framework that jointly trains identity and motion through adaptive cross‑domain adaptation.

Motivation

Existing methods focus on a single dimension—either identity‑driven or motion‑driven—causing over‑fitting to one mode and degrading consistency in the other. The figure below illustrates how fixed‑identity customization with increasing motion steps eventually harms identity consistency.

Motivation illustration
Motivation illustration

Fixed identity customization with gradually increasing motion steps (red box indicates optimal identity consistency).

Experiments reveal that motion priors irreversibly damage identity consistency and that a universal motion training degree cannot minimize identity degradation across different subjects.

Furthermore, the DiT model’s attention to identity and motion features shifts across denoising stages and network depths, requiring dynamic coordination to achieve high‑quality joint customization.

Feature focus shift
Feature focus shift

Deeper denoising stages increasingly emphasize identity learning (orange dashed line).

In the deepest DiT layers, the trend reverses and motion modeling dominates (red solid line).

Method

Dual‑Domain Perception Adapter

The adapter dynamically switches the current training phase (identity or motion) and uses frozen‑domain priors to guide the other mode, while a regularization strategy prevents knowledge leakage.

Adapter architecture
Adapter architecture

Joint Identity‑Motion Optimization

Unlike methods that fine‑tune the entire diffusion model, DualReal dynamically switches the training mode before each denoising iteration according to a predefined ratio, feeding the corresponding data into the DiT backbone. The input to each block consists of joint features, where N_t and N_v denote the numbers of text and visual tokens, respectively. The adapter adopts a bottleneck design with skip connections; GELU is used as the activation function. Linear projection weights W_id and W_mv map reference images into the latent space.

Through the stage‑fusion controller, the motion adapter’s output is scaled by a weight coefficient, while the identity output is weighted by a complementary coefficient. The modulated features are aggregated via residual connections into the DiT block output. Formally:

h_{l}^{'} = \alpha_{l} \cdot f_{mv}(h_{l}) + (1-\alpha_{l}) \cdot f_{id}(h_{l}) + h_{l}

where h_{l} is the output of the l -th DiT layer and h_{l}^{'} is the aggregated output after adaptation.

Regularization Strategy

Joint training suffers from a large distribution shift between identity and motion features. Unconstrained optimization can cause destructive interference, e.g., fine‑tuning the motion adapter on static images irreversibly reduces dynamic generation ability. To mitigate this, a binary mask selects which adapter’s parameters are active, enforcing gradient masking. The loss function is:

\mathcal{L}=\mathcal{L}_{recon}+\lambda\|M_{id}\odot\theta_{id}\|_{2}^{2}+\lambda\|M_{mv}\odot\theta_{mv}\|_{2}^{2}

where \mathcal{L}_{recon} is the video diffusion reconstruction loss, M_{id} and M_{mv} are binary masks for identity and motion adapters, and \theta denotes their parameters.

Stage Fusion Controller

The controller resolves competition between dimensions at different denoising stages by generating time‑aware scaling coefficients for each DiT depth. Input features are first pooled, then combined with a timestep embedding and modulated via LayerNorm. The resulting weight matrix W_{t} is defined as: W_{t}=\text{MLP}(\text{Pool}(f) \oplus e_{t}) These weights are integrated as gated fusion between textual and visual tokens, enabling adaptive allocation of modality‑specific importance across the diffusion process.

Experiments

Experimental Setup

Evaluation Datasets

Identity customization: 50 subjects (pets, plush toys, etc.) with 3–10 images each.

Motion customization: 21 challenging dynamic motion sequences from public datasets.

Each case provides 50 diverse prompts to test editability and scene variety.

Baseline Methods

DreamVideo (dual‑mode baseline) implemented on the same DiT backbone.

CogVideoX‑5B: sequential full‑parameter fine‑tuning of identity then motion (DreamBooth style).

LoRA fine‑tuning: separate LoRA modules for identity and motion, merged at inference.

MotionBooth: identity module trained with random videos to preserve motion capability.

Evaluation Metrics

Text‑Video Consistency (CLIP‑T): cosine similarity between prompt and all generated frames.

Identity Fidelity: CLIP‑I and DINO‑I scores measuring similarity between generated frames and reference identity images.

Temporal Motion Quality: T‑Cons (temporal consistency), Motion Smoothness (MS), Temporal Flickering (TF), and Dynamic Degree (DD) computed via RAFT optical flow.

Main Results

Qualitative : MotionBooth preserves identity but fails to model motion; DreamVideo suffers identity collapse due to mode conflict; CogVideoX‑5B and LoRA struggle with identity retention. DualReal achieves high identity consistency and coherent motion, demonstrating the advantage of joint training.

Quantitative : DualReal improves CLIP‑I by 21.7% and DINO‑I by 31.8% on average, attains the best scores on T‑Cons, Motion Smoothness, and Temporal Flickering, and ranks second on CLIP‑T. The average Dynamic Degree across all motions is 12.02, confirming faithful motion intensity.

Overall, DualReal significantly enhances both textual alignment and motion coherence while preserving identity fidelity, validating the effectiveness of its adaptive joint training strategy.

Resources

Paper: https://arxiv.org/abs/2505.02192

Project page: https://wenc-k.github.io/dualreal-customization/

Open‑source code: https://github.com/wenc-k/DualReal

Video GenerationDiffusion Modelsjoint trainingidentity preservationdual-domain adaptationmotion consistency
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.