From High-Fidelity to Real-World Use: LongCat Video Avatar 1.5 Open‑Source Release

LongCat Video Avatar 1.5 is now open‑source, delivering commercial‑grade lip sync, physical realism, long‑video stability, multi‑person interaction and 15× faster inference through Whisper‑large audio encoding, DMD 8‑step distillation and LoRA adapters, and it outperforms leading closed‑source models in extensive human‑rated benchmarks.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
From High-Fidelity to Real-World Use: LongCat Video Avatar 1.5 Open‑Source Release

Today the LongCat‑Video‑Avatar 1.5 model is officially open‑sourced as a commercial‑grade digital‑human video generator. The release highlights three major upgrades: (1) a fully commercialized experience where lip sync, facial expression, head pose and body motion remain precise and smooth even for long sentences, fast speech and singing; (2) support for richer scenarios—including real people, anime characters, animals and multi‑person dialogues—thanks to a high‑quality multi‑stage data pipeline; and (3) a 15× inference speed boost achieved by Distribution Matching Distillation (DMD) compressed to eight generation steps.

In the audio feature extraction stage the encoder is upgraded from Wav2Vec2 to Whisper‑large, providing larger parameters and multilingual priors that capture phoneme changes and rhythm more finely. This upgrade improves lip sync and whole‑body temporal stability, markedly reducing jitter, frame drops, freezes and identity drift in long videos.

Extensive evaluation using a custom EvalTalker benchmark (covering news, education, entertainment and commercial scenes with varied audio speed, emotion, number of participants, pose and occlusion) involved 770 raters who contributed 13,240 subjective scores, plus structured analysis from ten domain experts. The radar‑chart area shows LongCat‑Video‑Avatar 1.5 leading in physical realism, temporal stability, identity consistency and audio‑video coordination. User‑preference win rates are 65.9% over Kling Avatar 2.0, 61.1% over OmniHuman‑1.5 and 54.3% over HeyGen.

Single‑person scene score reaches 3.336, far above HeyGen and OmniHuman‑1.5, while multi‑person score is 2.730, substantially higher than InfiniteTalk (2.339) with clear speaker‑listener discrimination. Physical deformation issues occur in only 23.1% of subjects (vs. all competitors) and 9.4% for background; jump‑frame problems are just 0.8%, the lowest among peers, ensuring smooth long‑video generation.

Audio‑video coordination metrics show a 5.1% face‑body sync issue rate and a 29.8% lip‑sync issue rate, both lower than competing models, indicating more natural alignment of speech, expression and motion.

To further improve hand stability and motion continuity, the team introduced GRPO (Group Relative Policy Optimization) for frame‑level human‑preference alignment and added a first‑frame hand detection mechanism to increase the proportion of hand‑visible samples during training, reducing hand distortion and short‑term structural collapse. Inference uses a shared base model plus multiple LoRA adapters instead of three parallel models, cutting memory usage.

Real‑world tests demonstrate that a 10‑second video can be generated in about one minute, confirming the 15× speedup claim. The open‑source release includes GitHub, HuggingFace, a technical report PDF and a project page, inviting developers and creators to experiment, test and contribute to further advances in digital‑human video generation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIvideo generationBenchmarkdigital humandistillationLongCat-Video-AvatarWhisper-large
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.