BodyDance: Kuaishou Y‑Tech’s One‑Shot Human Motion Transfer Technology
BodyDance is Kuaishou Y‑Tech’s self‑developed human motion transfer system that converts a single static portrait into a dancing video using 3D pose reconstruction, semantic parsing, and style‑based generation, achieving high‑quality results without online finetuning and outperforming existing methods in realism, hand fidelity, and speed.
Abstract BodyDance, developed by Kuaishou Y‑Tech, is a human motion transfer technique that turns a single user photo containing a face into a dance video matching a driving template, requiring only a few seconds of processing and delivering strong business performance.
Background Traditional motion transfer often demands multiple images or videos and online fine‑tuning, which raises user interaction cost and deployment difficulty. BodyDance aims to reduce this cost by leveraging advances in human reconstruction, segmentation, and generative modeling.
Technical Process The overall pipeline consists of three modules:
3D human pose reconstruction: a SMPL‑based 3D mesh is generated from the input image and rendered.
Semantic target‑pose generation: a parsing model extracts the source pose, then predicts the target‑pose parsing to provide a semantic prior.
Target‑pose human generation: style encoders extract user style features, which are combined with the target‑pose semantics to synthesize the final human.
Key Innovations
High‑quality generation without fine‑tuning : a robust style‑based semanteme module regularizes region features into a Gaussian latent space and uses style biasing to fill invisible textures.
Bidirectional flow constraint : dual‑direction training provides pixel‑level supervision for detail and feature‑level supervision for overall consistency, improving temporal stability.
Adaptive completion for complex inputs : a Priori Semanteme Assist Module predicts target‑pose parsing from the source parsing, reducing difficulty for challenging inputs such as half‑body or occluded poses.
Model inference acceleration : knowledge distillation, template‑specific training, implicit semantic learning, and TensorRT FP16 inference reduce per‑frame generation time from 43.25 s to 8.32 s on a Tesla T4.
Results BodyDance achieves superior realism, hand generation, and robustness on complex cases compared with state‑of‑the‑art methods like Liquid Warping GAN with Attention, while requiring only a single image and no online fine‑tuning.
Conclusion & Outlook The presented system demonstrates leading performance in one‑shot motion transfer, yet current 2D‑based generation limits lighting, depth, and clothing realism. Future work will explore fully 3D‑based generation to further close the quality gap.
References
[1] Chan C, Ginosar S, Zhou T, et al. Everybody dance now. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 5933‑5942.
[2] Liu W, Piao Z, Tu Z, et al. Liquid warping GAN with attention: A unified framework for human image synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
