Artificial Intelligence 5 min read

ByteDance’s AI Video Generation Model Goku, Streamer‑Sales Live‑Selling Model, and MimicTalk 3D Talking‑Head Project

ByteDance and partners open‑source three AI projects—Goku for high‑quality text‑to‑video generation, Streamer‑Sales for multimodal live‑selling LLMs, and MimicTalk for rapid 3D talking‑head creation—detailing their core features, underlying transformer‑based architectures, training pipelines, and public repositories.

IT Services Circle

Mar 19, 2025

ByteDance’s AI Video Generation Model Goku, Streamer‑Sales Live‑Selling Model, and MimicTalk 3D Talking‑Head Project

ByteDance, in collaboration with Hong Kong University, released the open‑source AI video generation model Goku , built on a Rectified Flow Transformer architecture that can generate high‑quality videos from text or images, excelling in virtual digital‑human and advertising scenarios.

Core features of Goku include text/image‑to‑video generation (supporting animations, landscapes, animal behavior up to 20 seconds), virtual digital‑human creation via the Goku+ sub‑model for realistic livestream and customer‑service avatars, automated advertising video synthesis that reduces production cost to about 1 % of traditional methods, and state‑of‑the‑art VBench scores (84.85).

Technical highlights comprise a Rectified Flow‑based Transformer combined with a joint image‑video VAE, a training regime that replaces diffusion with Rectified Flow to improve efficiency and eliminate flicker, and a massive dataset of 36 million videos and 160 million images filtered by aesthetic scoring and OCR. The source code is available at https://github.com/Saiyan-World/goku and the project homepage at https://saiyan-world.github.io/goku/.

The second project, Streamer‑Sales , is an open‑source large‑language‑model (LLM) designed for live‑selling. It generates persuasive product copy using RAG‑enhanced retrieval, supports multimodal interaction (TTS/ASR for voice synthesis and recognition), integrates virtual digital‑human avatars via Xiling technology, and includes an Agent for real‑time queries such as logistics or price trends. Its repository can be found at https://github.com/PeterH0323/Streamer-Sales.

Finally, the MimicTalk project, a joint effort between Zhejiang University and ByteDance, enables rapid 3‑minute training to produce a personalized 3D talking‑head from a 2‑minute video clip. Leveraging NeRF‑based rendering, it delivers natural facial expressions and precise lip‑sync, while requiring only consumer‑grade video data. The code is released at https://github.com/yerfor/MimicTalk.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer multimodal AI video generation Virtual digital human

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.