Advances in Virtual Humans, Multimodal Technology, and General AI – Insights from OPPO
The article presents OPPO's latest research on virtual human audio‑lip and RGB driving, multimodal learning breakthroughs such as CETNETs and cross‑modal matching, and a reflective discussion on the challenges and future directions of general artificial intelligence, highlighting the interconnections among these three domains.
Introduction The talk, titled “The Connection Between Virtual Humans, Multimodal, and General AI,” is delivered by Zheng Zhitong, OPPO’s head of multimodal learning, and is organized by the DataFun community.
01 Virtual Human Technology Progress
1. Voice‑driven virtual humans OPPO has developed Audio2Lip and Sing2Lip for on‑device avatar driving. Audio2Lip supports seven avatars with industry‑leading energy consumption, latency, lip‑sync accuracy, and MOS scores. Sing2Lip adds rhythm information for more precise lip movements and, in its cloud version (Audio2Mesh), drives full‑face expressions directly from speech.
For simple cartoon avatars, a one‑to‑one algorithm is used on‑device; for realistic human avatars, a many‑to‑many contextual algorithm enables richer facial micro‑expressions.
2. RGB‑driven virtual humans A single camera captures a person’s image to drive an avatar. Initial attempts at full body reconstruction faced drift and clipping issues, which were mitigated using physical models, end‑to‑end algorithms, and motion retargeting.
3. Virtual human creation A 4D scanning pipeline creates realistic avatars, accelerated by custom algorithms to achieve acceptable processing times, followed by artistic refinement.
4. NeRF exploration OPPO investigated real‑time NeRF algorithms to generate environment assets, addressing the traditional challenges of low speed and visual artifacts.
02 Multimodal Technology Progress
1. CETNETs A paper presented at ECCV introduces macro‑level convolutional embedding and micro‑level transformer block innovations on top of the Vision Transformer backbone, achieving state‑of‑the‑art performance.
2. Cross‑modal matching Using a dual‑tower architecture for cross‑modal retrieval, OPPO’s models surpass Wukong under the same parameter budget, with additional improvements in data augmentation and label smoothing.
3. AIGC A library combining GANs, VAEs, and diffusion models supports various scenarios; recent projects include generating 2D digital employee photos and improving facial realism by re‑generating problematic regions.
03 Views on General AI
AI has reached a bottleneck where engineering challenges dominate: extensive patching, real‑time monitoring, and data labeling consume over 60% of costs. Robust AI requires system‑level engineering, large‑scale pre‑training followed by fine‑tuning, automated model compression, and human‑centered ethical design.
Current massive models resemble the Ptolemaic system—excellent at data fitting but lacking physical insight. To break the bottleneck, AI must integrate physical and logical understanding, ensuring out‑of‑distribution (OOD) generalization aligns with underlying causal structures.
04 The Connection Among the Three
Virtual humans embody 3D modality extraction and recreation, serving as a gateway for multimodal alignment. Multimodal perception (vision and audio) enables robust general AI, while multimodal generation provides data augmentation. Ultimately, virtual humans and multimodal technology are essential for the physical‑understanding component of general intelligence.
Thank you for listening.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.