Artificial Intelligence 12 min read

How JoyGen Achieves High‑Quality Audio‑Driven 3D Talking‑Face Video Editing

JoyGen introduces a two‑stage framework that combines 3D morphable model reconstruction with audio‑driven lip motion generation and depth‑aware visual synthesis, delivering precise audio‑lip synchronization and superior visual quality on both the HDTF benchmark and a newly built high‑resolution Chinese talking‑face dataset.

JD Cloud Developers

Jul 2, 2025

How JoyGen Achieves High‑Quality Audio‑Driven 3D Talking‑Face Video Editing

Abstract

This article presents JoyGen, an audio‑driven 3D depth‑aware talking‑face video editing system, a vertical application of AIGC. It proposes a novel two‑stage framework that first predicts 3DMM identity and expression coefficients, then synthesizes visual appearance using audio‑conditioned depth supervision. Experiments on the HDTF dataset and a newly constructed 130‑hour Chinese speaking‑face dataset demonstrate state‑of‑the‑art audio‑lip synchronization and visual quality.

Method

2.1 3D Deformable Model

Following Blanz & Vetter (2003), JoyGen uses a 3D Morphable Model (3DMM) where the face shape S is represented as S = \bar{S} + U_{id}\alpha + U_{exp}\beta, with \alpha and \beta controlling identity and expression respectively.

2.2 Audio‑to‑Lip Motion

JoyGen employs a flow‑enhanced variational auto‑encoder to learn a mapping from audio to facial motion. Because the 3D face mesh is determined by identity and expression coefficients, the identity coefficients remain fixed while the expression coefficients are driven by the audio.

2.3 Facial Depth Map

Using Deep3DFaceRecon, JoyGen predicts 3DMM coefficients from a single image, renders a 3D face mesh, and extracts a depth map. During inference, the expression coefficients predicted by the A2M model replace the original ones to generate depth maps for the target frame.

2.4 Editing Talking Face

The pipeline first encodes the input frame into a latent space with a pretrained image encoder, then uses a single‑step UNet (similar to MuseTalk) to predict the masked mouth region conditioned on audio features encoded by Whisper and the depth map. Cross‑attention fuses audio and visual features before decoding back to the image space.

Dataset

To address the lack of Chinese speaking‑face data, a high‑resolution dataset of ~1.1k videos (≈130 hours) was curated from Bilibili and Douyin, with strict filtering for single‑face, clear mouth visibility, Chinese audio, and minimal background noise. Statistics of video length, frame rate, gender distribution, and face size are shown in Figure 2.

Training Details

Loss Functions Both latent‑space L1 loss (L_{latent}) and pixel‑space L1 loss (L_{pixel}) are applied to encourage accurate reconstruction of facial details.

Depth Information Selection Depth maps are generated only for the mouth region (other regions set to zero). Random spatial perturbations are applied to the mouth depth during training, and depth supervision is dropped with 50 % probability to force the model to rely on audio cues.

Experiments

5.1 Data Preprocessing

Videos are segmented to retain only single‑face clips; five facial keypoints are extracted with MTCNN for 3DMM fitting, and DWPose provides face bounding boxes for cropping. Frames are sampled at 25 fps, with 10 k frames randomly selected for training.

5.2 Experimental Setup

The model is trained from scratch on eight NVIDIA H800 GPUs for one day. Input resolution is 256×256, batch size 128, Adam optimizer with learning rate 1e‑5, and loss weights λ1=2 (latent) and λ2=1 (pixel).

5.3 Results

Quantitative Results

On the HDTF benchmark, JoyGen outperforms baselines (Wav2Lip, MuseTalk) on all metrics (FID, LSE‑C, LSE‑D). Similar superiority is observed on the collected Chinese dataset, achieving the lowest FID (3.19) and near‑ground‑truth lip‑audio synchronization scores.

Qualitative Results

Visual comparisons show that JoyGen preserves sharp mouth details and maintains consistent lip motion with the driving audio, whereas competing methods either blur the mouth region or lose synchronization.

deep learning AIGC 3DMM audio-driven video talking face facial animation

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.