Inside Tencent’s HunyuanVideo-Avatar: How Open‑Source AI Generates Digital Human Videos
Tencent’s HunyuanVideo-Avatar converts a static portrait and an audio clip into a lip‑synced, expressive video using a multimodal diffusion Transformer, offering open‑source weights, detailed module designs, hardware requirements, code examples, and a candid assessment of its strengths and current limitations.
Model Overview
HunyuanVideo-Avatar is a multimodal diffusion Transformer (MM‑DiT) based model that converts a static portrait and an audio segment into a dynamic video with lip‑sync and facial motion. The model can run locally when sufficient GPU resources are available.
Architecture
Built on the MM‑DiT backbone, the system adds three key modules:
Character Image Injection Module : injects the reference character image directly during inference, avoiding training‑inference mismatch and preserving appearance while enabling expressive motion.
Audio Emotion Module (AEM) : extracts emotion cues from the reference image and applies them to the generated video, allowing finer control of facial expression aligned with the voice.
Face‑Aware Audio Adapter (FAA) : uses facial masks and cross‑attention to separate each character’s face, enabling different audio streams to be injected for multi‑character scenes.
Example Output
An example transforms a portrait into a singing video where the character performs a camp‑fire song with smooth lip‑sync, added head and eye motion, and exaggerated facial expressions. The same pipeline works for cartoon or 3D rendered characters, producing lifelike gestures that would otherwise require manual animation.
Open‑Source Release
Model weights are released openly. Source code is available at https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar/. Pre‑trained weights can also be downloaded from HuggingFace. The accompanying paper is hosted at https://arxiv.org/pdf/2505.20156.
Hardware Requirements
CUDA‑compatible NVIDIA GPU; the model has been tested on an 8‑GPU setup.
Generating a 704×768, 129‑frame video requires at least 24 GB VRAM (generation speed is very slow).
96 GB VRAM is recommended for optimal quality; on an 80 GB card, lowering the resolution can avoid out‑of‑memory errors.
Usage Example
cd HunyuanVideo-Avatar
JOBS_DIR=$(dirname $(dirname "$0"))
export PYTHONPATH=./
export MODEL_BASE="./weights"
checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
--input 'assets/test.csv' \
--ckpt ${checkpoint_path} \
--sample-n-frames 129 \
--seed 128 \
--image-size 704 \
--cfg-scale 7.5 \
--infer-steps 50 \
--use-deepcache 1 \
--flow-shift-eval-video 5.0 \
--save-path ${OUTPUT_BASEPATH}A Gradio UI is provided for interactive use, but inference on a laptop with 8 GB memory stalls due to the high VRAM demand.
Evaluation and Limitations
Compared with closed‑source models such as Google Veo 3, OpenAI Sora, and ByteDance Kling, HunyuanVideo‑Avatar offers the advantage of local execution. However, generated results are inconsistent: some videos show odd lip movements or blurry motion, emotion control depends on the reference image, and the model cannot capture emotion changes directly from audio. Generation speed remains a bottleneck—producing a 10‑second clip at decent resolution can take many minutes, limiting real‑time or live‑streaming applications.
Conclusion
HunyuanVideo‑Avatar provides an open‑source alternative for e‑commerce, livestreaming, content creation, and animation, but it still requires high‑end hardware, exhibits occasional artifacts, and needs faster inference to become practical for interactive scenarios.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
