Vidu S1 Launches Real‑Time Interactive AI Video Generation
Vidu S1, the new real‑time interactive video generation model from Shengshu Technology, combines voice‑controlled, unlimited‑duration streaming at 540p/25‑42 FPS on consumer GPUs with custom avatars and audio, redefining AI video creation from offline rendering to continuous, responsive digital characters.
Problem and motivation
Traditional video generation models output a fixed video after a single inference pass, focusing on higher resolution, longer duration, and motion consistency. This offline workflow prevents user intervention during generation, which is unsuitable for real‑time interaction scenarios such as video calls, virtual idols, and interactive live streams where users continuously ask questions, interrupt, and guide the character.
Real‑time interactive generation with Vidu S1
Vidu S1 introduces a real‑time interactive video generation paradigm. It supports voice‑controlled, continuous video output at 960×540 px resolution, 25 FPS (up to 42 FPS) on consumer‑grade GPUs, unlimited generation length, and custom initial images and voice tones.
Model architecture
The system adopts an autoregressive diffusion (AR + Diffusion) architecture. Each frame is generated conditioned on previously generated frames and the current user input, enabling immediate interruption and adaptation without restarting the whole generation.
Inference acceleration
TurboDiffusion serves as the inference framework. It reduces per‑frame compute cost through:
Low‑step generation
8‑bit SageAttention for accurate low‑precision attention [2]
Sparse‑linear SLA for efficient attention [3]
SpargeAttention for training‑free sparse attention acceleration [4]
These optimizations allow Vidu S1 to run at 540 P (960×540) and 25 FPS (max 42 FPS) on consumer GPUs.
Streaming deployment
TurboServe provides a streaming deployment engine that schedules inference requests, records user inputs, character states, and visual history, and dynamically allocates compute resources to maintain low latency.
Interactive capabilities
Vidu S1 can process spoken commands, recognize scene elements from the camera, and generate matching facial expressions, gestures, and body movements. Demonstrations include:
Uploading a popular “negative squirrel” meme image and obtaining a talking squirrel that responds in a Tianjin accent, follows commands such as “like”, “touch nose”, or “blink”.
Executing actions like “raise a tennis racket” or “place both hands over the chest and make a heart” with fluid animation.
Perceiving emotional cues and producing micro‑expressions such as anger.
Long‑duration tests show stable character identity and motion over several hours of continuous interaction.
Customization workflow
Users upload a single reference image and optionally a custom voice. The model instantly creates an interactive avatar that preserves the visual identity and voice timbre throughout the session.
Technical specifications
Resolution: 960×540 px<br/>Frame rate: 25 FPS (max 42 FPS)<br/>Hardware: consumer‑grade GPUs<br/>Generation mode: autoregressive diffusion (AR + Diffusion)<br/>Inference framework: TurboDiffusion (low‑step, SageAttention, SLA, SpargeAttention) [1‑4]<br/>Streaming engine: TurboServe [5]
References
TurboDiffusion: Accelerating Video Diffusion Models by 100‑200× [1]
SageAttention: Accurate 8‑Bit Attention for Plug‑and‑play Inference Acceleration [2]
SLA: Beyond Sparsity in Diffusion Transformers via Fine‑Tunable Sparse‑Linear Attention [3]
SpargeAttention: Accurate and Training‑free Sparse Attention Accelerating Any Model Inference [4]
TurboServe: Serving Streaming Video Generation Efficiently and Economically [5]
Technical report: https://jt-zhang.github.io/files/Vidu_S1.pdf
Public demo (no authentication required): https://www.vidu.cn/vidu-stream
Code example
[1] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times.
[2] SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration.
[3] SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention.
[4] SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference.
[5] TurboServe: Serving Streaming Video Generation Efficiently and Economically.Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
