How TaoAvatar Powers 3D Real‑Human Avatars in Taobao Vision

The Meta team’s CCIG 2026 presentation details the TaoAvatar system—a 3D Gaussian‑splatting based pipeline that captures multi‑view video, generates high‑quality static and dynamic digital humans, drives them with voice, text and gestures, and powers the immersive Taobao Vision shopping experience across XR devices.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
How TaoAvatar Powers 3D Real‑Human Avatars in Taobao Vision

At the China Image & Graphics Conference 2026 (CCIG 2026) in Guangzhou, the Taobao Meta technology team presented a full report titled “3D Real‑Human Digital Avatar in Taobao Vision”. The talk attracted dozens of academic and industry participants who expressed interest in deeper collaboration.

Digital humans come in many styles: lightweight 2D avatars, video‑driven avatars (e.g., Seedance, LPM), and emerging 3D avatars such as Meta’s Codec Avatar. Their applications span film, games, communication, and e‑commerce.

The team highlighted three core capabilities: TaoModel – fast 3DGS static human modeling via calibration, segmentation and Gaussian reconstruction; TaoAvatar – multi‑view capture, SMPLX++ geometry and 3DGS dynamic reconstruction that yields a drivable 3D human asset; and TaoVideo – 4DGS volumetric video that records complex dynamics like flowing skirts without relying on explicit body models.

TaoAvatar is built on 3D Gaussian Splatting, addressing traditional 3D modeling’s high computation, insufficient detail, and mobile‑device constraints. Reported technical indicators include PSNR > 35, stability, real‑time facial and body driving at 90 FPS, 2K binocular resolution, multimodal interaction latency < 2 s, memory < 2.5 GB, production cost < 20 k CNY, and delivery time < 1 week.

For body capture, multi‑view video is processed into SMPLX++ motion‑capture data. The pipeline uses layered supervision—EMOCA for the face, HaMeR for hands, and SAM‑3DBody for the body—followed by per‑frame SMPLX initialization, mesh tracking, and non‑rigid clothing deformation learning. Small‑studio and large‑studio coordination (≥5 full‑body views, ≥3 hand views) yields an average PVE error of 6‑7 mm, outperforming the open‑source EasyMocap baseline.

Dynamic reconstruction upgrades from a teacher‑student, lightweight design to a unified “single‑ID, multi‑clothing, multi‑action” model. Multi‑view studio data are used to capture diverse outfits and actions, then parsed into fine‑grained body parts. Shared weights enable cross‑clothing reuse, and PCA‑baked deformation fields (≈400 MB) allow real‑time inference at 90 FPS on consumer GPUs.

Voice and gesture driving rely on GestureDiT, which generates base gestures conditioned on speech and text, and Qwen‑LLM, which injects semantic gestures retrieved from a pre‑defined library. The system predicts one second of motion per inference step, runs at RTF < 1 on an RTX 3090, and produces consistent hand gestures for both Chinese and English inputs.

TaoVideo extends the approach to 4DGS volumetric video. Gaussian points are animated with spline curves for both motion and appearance, and adaptive control points balance quality versus storage. The pipeline achieves PSNR 31‑35, LPIPS 0.06‑0.1, and assets of ~400 MB that render at 90 FPS.

These technologies power Taobao Vision, an XR‑based shopping platform that won the 2025 Apple Design Award. The online app runs on Vision Pro, delivering immersive 3D product views, voice and image search, and multimodal interaction. Offline flagship stores in multiple Chinese cities use the same AI‑3D‑XR stack to showcase thousands of SKUs in limited space, with brand cases such as Baxi virtual fitting rooms and a 3D fashion showcase at the Milan Winter Olympics.

Future work includes low‑cost single‑image or sparse‑view digital‑human generation and the CVPR‑presented FHAvatar, which reconstructs composable 3D Gaussian heads from arbitrary viewpoints, enabling rapid facial and hair editing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

digital humanreal-time renderingXR3D Gaussian Splattingmulti-view reconstructionSMPLXTaoAvatarTaobao Vision
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.