Creating Lifelike Talking Avatars from Voice and Photo with EchoMimic
This article introduces EchoMimic V1 and V2, open‑source generative digital‑human systems that turn a single voice clip and a portrait photo into synchronized talking avatars, covering their technical background, architecture, training strategies, performance comparisons, and potential application scenarios.
By providing just a voice segment and a photo, users can automatically generate a vivid video character whose lip‑sync closely matches the speech.
Alipay Multimodal Application Lab released two generative digital‑human projects, EchoMimicV1 and EchoMimicV2, in 2024; the related technical papers were accepted at CVPR 2025 and AAAI 2025. The source code is available at https://github.com/antgroup/echomimic and https://github.com/antgroup/echomimic_v2 . Researcher Li Yuming presented the work at QCon Global Software Development Conference 2025.
Traditional Digital Human vs. Generative Digital Human
2D Digital Human Technology Path
2D digital humans are created by pre‑recording a person’s appearance and body motions, then using speech‑driven mouth‑editing algorithms to generate talking videos. This approach is cost‑effective and flexible, suitable for rapid content creation such as digital anchors, educational videos, and advertisements. Early methods relied on GANs (e.g., Wav2Lip, Video‑Retalk, Sadtalker) and later evolved to NeRF‑based techniques (Ada‑NeRF, ER‑NeRF, AvatarRex), which improve realism but require more high‑quality training data.
3D Digital Human Technology Path
3D digital humans combine AI with computer graphics, becoming a core component of the metaverse. Advances in 3DMM and differentiable rendering now allow high‑fidelity facial reconstruction from a single photo and full‑body motion capture using a monocular camera, reducing reliance on expensive scanning equipment.
Generative Digital Human Technology Introduction
Recent breakthroughs in AIGC (AI‑generated content) have produced high‑quality images (e.g., Stable Diffusion) and videos (e.g., Sora, EMO). These methods dramatically lower the cost of creating realistic digital humans and have become a hot research and industry trend. They emphasize temporal dynamics, identity preservation, and multi‑style adaptability, expanding possibilities in entertainment, virtual communication, and interactive media.
Driving Modes for Character Animation
Three primary driving modes exist: video‑driven, text‑driven, and audio‑driven. Most current methods focus on pose‑driven video generation, extracting pose, dense pose, depth, mesh, or optical flow from a driving video and using them as control signals for diffusion models. Notable approaches include MagicPose (ControlNet), AnimateAnyone, MimicMotion, MotionFollower, UniAnimate, DreamPose, MagicAnimate, Human4DiT, and HumanDiT.
EchoMimic: Voice‑Driven Portrait Animation
EchoMimic enhances 2D digital‑person driving efficiency. Users upload a portrait (or real‑person image) and a voice or video clip, and the system generates a matching talking‑scene video. The performance rivals commercial solutions while offering flexible driving modes (voice, pose, or combined).
EchoMimic V1 Architecture
The V1 framework, inspired by Alibaba’s EMO, consists of two UNet networks: a reference UNet that encodes the input avatar’s appearance to preserve identity and background, and a denoising UNet that receives multimodal inputs (audio features, pose features) and performs diffusion in latent space to produce the final video. V1 supports pure‑voice driving, pure‑pose driving, and combined voice‑pose driving.
EchoMimic V2 Training Strategy
V2 introduces a three‑part training strategy centered on Audio‑Pose Dynamic Harmonization (APDH). APDH gradually simplifies conditioning complexity, coordinating the primary audio condition with the auxiliary pose condition in a “waltz‑like” manner. It comprises Pose Sampling (PS) and Audio Diffusion (AD), improving multimodal adaptability, robustness, and video smoothness.
Inference Acceleration
To address the slow inference of Stable Diffusion‑based models, knowledge distillation and pipeline optimization were applied, achieving a 9× speedup. Accelerated models for both V1 and V2 have been open‑sourced.
Performance Comparison
Quantitative and qualitative evaluations show V1 holds advantages over existing third‑party methods, and V2 outperforms the latest pose‑driven algorithms (e.g., CyberHost) in both metrics.
Application Scenarios
Integrating generative digital humans with multimodal large models to serve as front‑ends for real‑time interaction.
Combining generative digital humans with music‑generation models as new AI‑creativity tools.
Specialized digital‑human models for product‑interaction videos, providing fresh material for advertising.
Summary and Outlook
Although generative digital humans have made significant progress over traditional methods, challenges remain such as fidelity gaps, inconsistency, unnatural motions, and low resolution. The development paradigm has evolved from dual‑tower Stable Diffusion to single‑tower SVD and now to Image‑to‑Video (I2V) foundation models with modular components. Ongoing advances in video generation are expected to further improve the quality of generative digital humans.
Alipay Experience Technology
Exploring ultimate user experience and best engineering practices
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
