EchoMimicV2: High‑Quality Audio‑Driven Half‑Body Human Animation with Simple Inputs

EchoMimicV2 is an open‑source digital‑human framework that generates high‑quality half‑body animation videos from a single reference image, an audio clip, and a hand‑gesture sequence, addressing challenges of facial portrait limits, complex condition injection, and inference latency in audio‑driven animation.

Alipay Experience Technology
Alipay Experience Technology
Alipay Experience Technology
EchoMimicV2: High‑Quality Audio‑Driven Half‑Body Human Animation with Simple Inputs

EchoMimicV2 is an open‑source digital‑human project from Ant Group’s Alipay terminal algorithm data team. By providing a reference image, an audio clip, and a hand‑gesture sequence, it can generate high‑quality half‑body animation videos while ensuring coordination between the portrait and the audio content.

Demo image
Demo image

Paper: "EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation" (https://arxiv.org/abs/2411.10061)

Project page: https://antgroup.github.io/ai/echomimic_v2/

Code repository: https://github.com/antgroup/echomimic_v2

1. Technical Overview

In the AI 2.0 era, digital‑human technology based on diffusion models has rapidly advanced, but practical challenges remain, such as the focus on facial portraits and the neglect of body and hand motions, as well as the complexity and instability introduced by multiple conditioning signals.

The team proposes an end‑to‑end audio‑driven framework with three key techniques:

Audio‑Pose Dynamic Harmonization (APDH) to coordinate audio and pose conditions while reducing pose redundancy.

Head Partial Attention to seamlessly integrate head‑only data augmentation for richer facial expressions without extra modules.

Multi‑stage PhD Loss to improve motion representation under incomplete pose conditions and enhance low‑level visual quality.

2. Background

Video diffusion has enabled significant progress in human animation generation, which aims to synthesize realistic 2D human videos from multimodal controls (text, audio, pose). Existing methods often focus on facial portraits and ignore synchronization of the body below the shoulders, leading to a gap between research and industrial needs.

3. EchoMimicV2 Results

3.1 Chinese‑driven examples

Result image
Result image
Result image
Result image

3.2 English‑driven examples

Result image
Result image
Result image
Result image

3.3 Algorithm comparison

4. Method

Overall architecture
Overall architecture

4.1 Network Architecture

EchoMimicV2 follows the ReferenceNet design of Alibaba’s EMO framework, consisting of a Reference UNet and a Denoising UNet. The audio‑driven module includes a Pose Encoder, an Audio Encoder, and the Denoising UNet, enabling conversion from audio features to image features for high‑quality video synthesis.

4.2 Audio‑Pose Dynamic Harmonization Training Strategy

APDH comprises Pose Sampling and Audio Diffusion, gradually simplifying condition complexity while synchronizing audio (primary) and pose (auxiliary) signals, improving robustness and generation smoothness.

4.3 Facial Data Augmentation

When audio controls only the head region via Head Partial Attention, the method can seamlessly incorporate padded head photos to enhance facial expressions without extra plugins.

4.4 Multi‑Stage Loss

The training is split into three stages: pose‑dominant, detail‑dominant, and quality‑dominant, each using a tailored loss (PhD Loss) to stabilize training and improve visual fidelity.

5. Future Outlook

Current limitations include reliance on pre‑defined hand‑keypoint sequences and reduced performance on uncropped full‑body images. Future work will explore audio‑to‑gesture generation and robust video synthesis from arbitrary reference images.

6. Related Work

Human animation generation can be driven by video, text, or audio. Recent diffusion‑based methods (e.g., MagicPose, AnimateAnyone, MimicMotion, UniAnimate) focus on pose‑driven or audio‑driven synthesis, often using lightweight pose encoders or ControlNet. EchoMimicV2 extends these ideas by jointly leveraging audio, partial pose, and head‑only augmentation.

7. References

Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems , 34, 8780‑8794.

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems , 33, 6840‑6851.

Rombach, R., et al. (2022). High‑resolution image synthesis with latent diffusion models. In Proceedings of CVPR , 10684‑10695.

Guo, Y., et al. (2023). AnimateDiff: Animate your personalized text‑to‑image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 .

Chen, H., et al. (2023). VideoCrafter1: Open diffusion models for high‑quality video generation. arXiv preprint arXiv:2310.19512 .

... (additional references omitted for brevity) ...

Video GenerationDiffusion ModelsAI researchdigital humanaudio-driven animation
Alipay Experience Technology
Written by

Alipay Experience Technology

Exploring ultimate user experience and best engineering practices

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.