Artificial Intelligence 13 min read

How JoyGen Delivers High‑Quality Audio‑Driven 3D Talking‑Face Video Editing

JoyGen introduces a two‑stage framework that combines 3D facial reconstruction with audio‑driven motion generation to produce synchronized, high‑fidelity talking‑face videos, and validates its effectiveness on both the HDTF benchmark and a newly built high‑resolution Chinese speaking‑face dataset.

JD Tech Talk

Jul 2, 2025

How JoyGen Delivers High‑Quality Audio‑Driven 3D Talking‑Face Video Editing

This article introduces JoyGen, an audio‑driven 3D depth‑aware talking‑face video editing system, a vertical application of AIGC. The authors present the project homepage, the arXiv paper, and the GitHub repository, inviting collaboration and feedback.

Abstract

Recent advances in talking‑face video generation have improved realism, yet precise audio‑mouth synchronization and high visual quality remain challenging. JoyGen proposes a novel two‑stage framework: first, a 3D morphable model (3DMM) predicts identity and expression coefficients; second, audio features combined with facial depth maps provide comprehensive supervision for mouth‑shape synchronization. A 130‑hour high‑quality Chinese speaking‑face dataset is constructed, and JoyGen is trained on both the open‑source HDTF dataset and this curated set, achieving superior synchronization and visual quality.

Method

The overall pipeline (see Figure 1) consists of a training phase (upper half) and an inference phase (lower half). The method includes:

2.1 3D Deformable Model

Following Blanz & Vetter (2003), a 3DMM represents facial geometry using PCA: a mean shape S, identity bases U_id, and expression bases U_exp, with coefficients α (identity) and β (expression).

2.2 Audio‑to‑Mouth Motion

A flow‑enhanced variational auto‑encoder learns a mapping from audio to facial motion. Since the 3D mesh is determined by identity and expression coefficients, the identity remains fixed while expression varies during reconstruction.

2.3 Facial Depth Map

Using Deep3DFaceRecon, identity and expression coefficients are predicted from a single image, then a 3D mesh is rendered to obtain a depth map. During inference, the A2M model predicts expression coefficients to generate the corresponding depth map (see inference pipeline in Figure 1).

2.4 Talking‑Face Editing

The face encoder maps the input image to a low‑dimensional latent space, reducing computational load. A single‑step UNet (similar to MuseTalk) predicts lip motions conditioned on the masked target frame, reference frame, and audio features encoded by Whisper, with cross‑attention between audio and image features. The latent representation is decoded back to the image space to produce the final frame.

Dataset

To address the lack of Chinese speaking‑face data, a high‑resolution dataset of ~1.1k videos (≈130 hours) was built from Bilibili and Douyin, with strict filtering criteria (single face, clear mouth/teeth, Chinese audio, minimal background noise, etc.). Statistics of video length, frame rate, gender distribution, and face size are shown in Figure 2.

Training Details

Loss Functions : L1 loss is applied in both latent space (L_latent) and pixel space (L_pixel) to capture fine facial details.

Depth Information Selection : Deep3DFaceRecon predicts identity and expression coefficients; only the mouth region depth is retained (other regions set to zero). Random spatial perturbations and occasional depth dropout (50% chance) improve robustness and alignment.

Experiments

5.1 Data Pre‑processing

Training uses the HDTF dataset and the newly collected Chinese dataset. Videos are segmented to keep only single‑face clips, facial landmarks are extracted with MTCNN, and face bounding boxes are obtained via DWPose. Frames are sampled at 25 fps, with 10 k frames randomly selected for training.

5.2 Experimental Setup

The model is trained from scratch on eight NVIDIA H800 GPUs for one day, with input resolution 256×256, batch size 128, Adam optimizer (lr = 1e‑5), and loss weights λ1 = 2 (latent) and λ2 = 1 (pixel).

Evaluation Metrics : FID for visual quality; LSE‑C and LSE‑D (from Wav2Lip) for audio‑mouth synchronization, using unpaired evaluation on ~500 (HDTF) and ~900 (custom) audio‑video pairs.

Baselines : Wav2Lip (re‑trained lip‑sync expert) and MuseTalk (re‑implemented generation pipeline) are compared.

5.3 Results

Quantitative : On HDTF, JoyGen outperforms baselines on all metrics (see Table 1 and Figure 3). On the custom dataset, JoyGen achieves the lowest FID (3.19) and competitive LSE‑C/D scores (see Table 2 and Figure 4).

Qualitative : Visual comparisons (Figure 5) show JoyGen produces clearer mouth regions and more accurate lip‑audio alignment than Wav2Lip and MuseTalk, despite Wav2Lip sometimes achieving better LSE scores.