Artificial Intelligence 10 min read

FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

FantasyTalking generates high-fidelity, coherent talking portraits from a single static image by employing a two-stage audio-visual alignment—global segment-level motion and frame-level lip refinement—combined with face-centric cross-attention for identity preservation and a motion-intensity module that lets users control expression and body movement, achieving superior realism, synchronization, and performance over prior methods.

Amap Tech

May 8, 2025

FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

FantasyTalking is the core engine of Amap’s video digital‑human technology, designed to generate high‑fidelity, coherent talking portraits from a single static reference image. Existing methods struggle to capture subtle facial expressions, full‑body motions, and dynamic backgrounds, limiting realism.

The proposed framework introduces a novel two‑stage audio‑visual alignment strategy. In the first stage, a segment‑level training scheme aligns audio‑driven dynamics with the entire scene (reference portrait, contextual objects, background) to establish global coherent motion. In the second stage, a lip‑tracking mask refines lip motion at the frame level, ensuring precise synchronization with the audio signal.

To preserve identity without sacrificing motion flexibility, the authors replace the conventional reference network with a face‑focused cross‑attention module, maintaining facial consistency while allowing full‑body movement. An additional motion‑intensity modulation module explicitly controls the strength of facial expressions and body motions, extending controllability beyond lip movements.

Key contributions include:

Higher realism and coherence: supports natural facial, lip, and body motions together with dynamic backgrounds.

Precise audio‑lip synchronization via dual‑stage alignment.

Balanced identity preservation and dynamic flexibility through face‑centric cross‑attention.

Controllable motion intensity allowing users to adjust expression and body movement strength.

Extensive experiments on both constrained and natural talking‑head datasets demonstrate significant improvements over state‑of‑the‑art methods (e.g., Hallo3, Sonic, OmniHuman‑1) in metrics such as FID, FVD, IDC, aesthetic scores, and audio‑visual sync.

The paper also details the implementation of the dual‑stage alignment, the identity‑preserving cross‑attention, and the motion‑intensity modulation network, providing equations and architectural diagrams. Project page, code, and the full arXiv paper are linked for reproducibility.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning identity preservation audio-visual alignment motion synthesis talking portrait video diffusion

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.