ByteDance’s AI Video Generation Model Goku, Streamer‑Sales Live‑Selling Model, and MimicTalk 3D Talking‑Head Project
ByteDance and partners open‑source three AI projects—Goku for high‑quality text‑to‑video generation, Streamer‑Sales for multimodal live‑selling LLMs, and MimicTalk for rapid 3D talking‑head creation—detailing their core features, underlying transformer‑based architectures, training pipelines, and public repositories.
ByteDance, in collaboration with Hong Kong University, released the open‑source AI video generation model Goku , built on a Rectified Flow Transformer architecture that can generate high‑quality videos from text or images, excelling in virtual digital‑human and advertising scenarios.
Core features of Goku include text/image‑to‑video generation (supporting animations, landscapes, animal behavior up to 20 seconds), virtual digital‑human creation via the Goku+ sub‑model for realistic livestream and customer‑service avatars, automated advertising video synthesis that reduces production cost to about 1 % of traditional methods, and state‑of‑the‑art VBench scores (84.85).
Technical highlights comprise a Rectified Flow‑based Transformer combined with a joint image‑video VAE, a training regime that replaces diffusion with Rectified Flow to improve efficiency and eliminate flicker, and a massive dataset of 36 million videos and 160 million images filtered by aesthetic scoring and OCR. The source code is available at https://github.com/Saiyan-World/goku and the project homepage at https://saiyan-world.github.io/goku/ .
The second project, Streamer‑Sales , is an open‑source large‑language‑model (LLM) designed for live‑selling. It generates persuasive product copy using RAG‑enhanced retrieval, supports multimodal interaction (TTS/ASR for voice synthesis and recognition), integrates virtual digital‑human avatars via Xiling technology, and includes an Agent for real‑time queries such as logistics or price trends. Its repository can be found at https://github.com/PeterH0323/Streamer-Sales .
Finally, the MimicTalk project, a joint effort between Zhejiang University and ByteDance, enables rapid 3‑minute training to produce a personalized 3D talking‑head from a 2‑minute video clip. Leveraging NeRF‑based rendering, it delivers natural facial expressions and precise lip‑sync, while requiring only consumer‑grade video data. The code is released at https://github.com/yerfor/MimicTalk .
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.