Artificial Intelligence 6 min read

LongCat‑Video: Meituan’s Model for Text‑to‑Video, Image‑to‑Video & Continuation

LongCat‑Video, an open‑source video generation model from Meituan, adopts a unified multi‑task architecture to handle text‑to‑video, image‑to‑video and video‑continuation, delivers minute‑long high‑quality clips with coarse‑to‑fine inference, achieves benchmark scores comparable to leading models like Wan2.2, and provides a one‑click deployment tutorial on HyperAI.

HyperAI Super Neural

Nov 25, 2025

LongCat‑Video: Meituan’s Model for Text‑to‑Video, Image‑to‑Video & Continuation

Meituan has open‑sourced LongCat‑Video, a video generation model that unifies three tasks—text‑to‑video, image‑to‑video, and video‑continuation—within a single architecture. The model distinguishes tasks by the number of conditioning frames and is pretrained on the video‑continuation task to enable generation of videos lasting several minutes while avoiding color distortion and other quality degradations.

Key capabilities include a multi‑task unified framework, long‑video generation, efficient coarse‑to‑fine inference that produces 720p, 30 fps clips in a few minutes, and a multi‑reward reinforcement learning (RLHF) approach using Group Relative Policy Optimization (GRPO) to boost performance.

Benchmark evaluation (internal) shows that on the text‑to‑video task LongCat‑Video attains visual and motion quality scores nearly equal to the top open‑source model Wan2.2. For image‑to‑video, its visual quality score surpasses Wan2.2, though the authors note remaining gaps in image‑alignment and overall quality.

Efficient inference is achieved through a "coarse‑to‑fine" strategy, allowing the model to generate 720p, 30 fps video within minutes, improving both speed and fidelity compared with prior approaches.

Deployment tutorial is provided on the HyperAI platform. Users can click the tutorial link, clone the repository, select an NVIDIA RTX PRO 6000 Blackwell GPU and a PyTorch image, choose a pay‑as‑you‑go or subscription plan, wait for resource allocation (≈3 minutes for the first clone), and then run the demo that offers four example modes (Image‑to‑Video, Text‑to‑Video, Long Video, Video Continuation). Advanced options let users adjust negative prompts, resolution, and randomness seed.

Demo screenshots illustrate the four example modes and the interface for uploading an image, entering a prompt, and configuring generation parameters.

Overall, LongCat‑Video demonstrates that an open‑source model can match leading proprietary solutions on core quality metrics while offering a unified, extensible framework for multiple video generation tasks.