Inside Alibaba’s AliPlayStudio: Real-Time AI Video Interaction Techniques
This article details how Alibaba’s AliPlayStudio combines advanced computer‑vision algorithms—such as human semantic segmentation, gesture and pose detection, controllable style transfer, and face‑fusion—optimised for low‑power mobile and embedded devices, to deliver engaging real‑time video interactions across online and offline marketing scenarios.
Background
Alibaba Search Business Unit and Zhejiang University professor Song Ming‑Li’s team jointly built the AliPlayStudio video‑interactive platform, deploying it in multiple online (Taobao app’s photo‑capture, scan, keyword search) and offline (mall and cinema large‑screen) scenarios. By integrating resources from Tmall brands, Alibaba Pictures, Youku IP, and Taobao influencers, the platform creates AI‑driven interactive marketing that converts engaged users into actual consumers through rewards, shop follow‑ups, and product recommendations.
Human Semantic Segmentation
Semantic segmentation assigns pixel‑level labels to objects; for humans, Human Parsing distinguishes body parts such as face, hair, and limbs. We improved data, models, and framework: synthetic data generation with color transfer and realistic placement increased sample volume dramatically; the high‑accuracy model uses an Inception backbone with ASPP, while the real‑time model adopts a lightweight encoder, fast down‑sampling, and UNet‑style decoder. The resulting model (~1.7 MB, 0.5 MB after quantisation) achieves 0.94 mIoU and runs at 25 FPS on a Qualcomm 625 device (320×240 input). Joint training with pose estimation further boosts performance. Real‑time segmentation with gesture‑controlled background replacement is demonstrated on mobile and mall screens.
Rock‑Paper‑Scissors Game: Gesture Recognition
During Double‑11 2018, the “Celebrity Rock‑Paper‑Scissors” game was launched on Taobao, marking the first real‑time hand‑gesture game on mobile. The system detects hand gestures (scissors, rock, paper, others) frame‑by‑frame using an SSD‑based detector with an MNasNet‑derived backbone, enhanced feature pyramid (FPN) fusion, and knowledge distillation. The final model is 1.9 MB, runs in 17 ms on iOS devices, and reaches 0.984 AP (IoU = 0.5) on the test set.
Human Pose Detection
We built a real‑time, high‑accuracy pose detection model for RGB images and video. Using an Encoder‑Decoder architecture with MobileNet as backbone and a decoder that upsamples via transposed convolutions, combined with OpenPose‑style Part Affinity Fields, the model runs at 11 ms per 320×320 frame on a Snapdragon 845 and achieves 15 FPS on an RK3399 embedded chip, with a model size of 2.5 MB.
Image Stylization
Fast style transfer preserves content semantics while applying a style image. To enable controllable brush‑stroke size, we introduced a Stroke Pyramid that splits the network into multiple stroke branches, each trained with different receptive fields and style scales. Stroke interpolation in feature space allows continuous brush‑stroke control. The model (≈0.99 MB) processes a 1024×1024 image in 0.09 s on an NVIDIA Quadro M6000 and was presented at ECCV 2018.
Face Fusion
Face fusion merges a user’s selfie with a template face, preserving the template’s accessories while adapting facial geometry. Challenges include varying selfie angles, lighting, and device quality. Our pipeline detects facial landmarks, performs geometry‑aware warping, and applies a LUT‑based skin‑tone correction to handle illumination and high‑light issues, resulting in realistic “face‑swap” outputs.
Conclusion
Since March 2018, more than ten AI‑driven visual interaction campaigns have been launched on the Taobao app (e.g., “The Richest Man in the World”, “Mission Impossible 6”, Double‑11 “Celebrity Rock‑Paper‑Scissors”, Tmall International “Black 5” face‑scan, Double‑12 “AI Fortune‑Telling”, New‑Year “Taobao Dolls”). Offline large‑screen interactions (e.g., “Yellow Deer” in malls) have also been deployed. By platformising these generic interaction capabilities, smaller brands can quickly configure AI‑interactive marketing activities, driving user engagement and platform growth.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
