Artificial Intelligence 13 min read

JoyGen: Audio‑Driven 3D Depth‑Aware Talking‑Face Video Editing Explained

JoyGen introduces a two‑stage framework that generates high‑quality talking‑face videos by synchronizing lip movements with input audio using 3DMM‑based identity and expression coefficients, depth‑aware supervision, and a newly built high‑resolution Chinese speaking‑face dataset, achieving state‑of‑the‑art performance on multiple benchmarks.

JD Retail Technology

Jul 1, 2025

JoyGen: Audio‑Driven 3D Depth‑Aware Talking‑Face Video Editing Explained

Abstract

This article presents JoyGen, a novel two‑stage framework for audio‑driven 3D depth‑aware talking‑face video editing. It first predicts 3DMM‑based identity and expression coefficients from audio, then combines audio features with facial depth maps to provide comprehensive supervision for accurate lip‑audio synchronization and high visual quality. A 130‑hour high‑quality Chinese speaking‑face dataset was constructed for training and evaluation.

Method

The overall pipeline consists of a training stage and an inference stage (see Figure 1). In the first stage, a 3D deformable model (3DMM) predicts identity and expression coefficients; audio features drive a motion generation model to produce expression coefficients. Facial depth maps are rendered from the reconstructed 3D mesh and used as supervision. For inference, a pretrained A2M model predicts expression coefficients, which are combined with audio features via cross‑attention to guide a UNet‑based single‑step facial synthesis.

3D Deformable Model

Following Blanz and Vetter (2003), the 3DMM represents a face as S + U_id * α + U_exp * β, where S is the mean shape, U_id and U_exp are orthogonal bases for identity and expression, and α, β are the corresponding coefficients.

Audio‑to‑Lip Motion

A flow‑enhanced variational auto‑encoder learns a mapping from audio to facial motion. Since the 3D mesh is determined by identity and expression coefficients, fixing the identity during reconstruction allows the model to focus on expression dynamics driven by audio.

Facial Depth Map

Single‑image 3DMM coefficients are obtained via Deep3DFaceRecon. The predicted coefficients generate a 3D mesh, which is rendered to obtain a depth map. During inference, the expression coefficients predicted by the Real3DPortrait‑trained A2M model replace the original ones to produce the depth map (see Figure 1).

Talking‑Face Editing

The face encoder maps the input image to a low‑dimensional latent space, reducing computational load. A single‑step UNet, similar to MuseTalk, predicts lip motion conditioned on the masked target frame, a random reference frame, the corresponding depth map, and audio features encoded by Whisper. The latent features are concatenated along the channel dimension to form the UNet input.

Dataset

To address the lack of Chinese speaking‑face data, a high‑resolution dataset containing ~1.1k videos (≈130 hours) was curated from Bilibili and Douyin. Strict filtering ensured single‑face clips, clear mouth/teeth visibility, Chinese audio without background noise, and consistent speaker identity. Statistics of video length, frame rate, gender distribution, and face size are shown in Figure 2.

Training Details

L1 loss is applied in both latent and pixel spaces ( L_latent and L_pixel) to capture fine facial details. Depth information is limited to the mouth region (based on 80 Mediapipe keypoints); random spatial perturbations and 50 % dropout of depth cues improve robustness. Training uses 8 NVIDIA H800 GPUs for one day, with 256×256 inputs, batch size 128, Adam optimizer (lr = 1e‑5), and loss weights λ₁ = 2, λ₂ = 1.

Experiments

Quantitative evaluation employs FID for visual quality and LSE‑C / LSE‑D (from Wav2Lip) for lip‑audio synchronization. Baselines include Wav2Lip and MuseTalk, retrained on the same data. On the HDTF benchmark, JoyGen outperforms baselines across all metrics (see Table 1 and Figure 3). On the newly collected Chinese dataset, JoyGen achieves the lowest FID (3.19) and synchronization scores close to ground truth (see Table 2 and Figure 4). Qualitative results (Figure 5) demonstrate superior lip alignment and sharper mouth regions compared with competing methods.

Conclusion

JoyGen delivers accurate audio‑driven lip synchronization and high‑fidelity visual output for talking‑face video editing, advancing AIGC applications in the Chinese language domain.