Artificial Intelligence 22 min read

How AI-Powered Hand Gesture Detection Drove a Double‑11 Celebrity Rock‑Paper‑Scissors Game

This article details how Alibaba leveraged AI-driven hand‑gesture detection and a lightweight SSD‑based object detection model to create an interactive rock‑paper‑scissors game for Double‑11, addressing challenges of undefined gestures, real‑time mobile performance, and data collection, and achieving over 16 million page views and high accuracy.

Alibaba Cloud Developer

Dec 20, 2019

How AI-Powered Hand Gesture Detection Drove a Double‑11 Celebrity Rock‑Paper‑Scissors Game

Project Background

Alibaba sought new ways to connect merchants and consumers as traditional marketing became less effective and users grew resistant to frequent promotions. Interactive mini‑games that are easy to play can boost engagement, and advances in computer‑vision AI provide novel interaction methods.

Problem Definition

The rock‑paper‑scissors (RPS) game requires real‑time hand‑gesture recognition, which is a hand‑gesture detection task. Classification alone cannot handle multiple hands, varying positions, sizes, and background clutter, so a detection‑first approach is needed.

Challenges

Uncertain number of hands and potential cheating gestures.

Hands appear small amid faces and background, making classification difficult.

No clear definition for each gesture; variations in shape, angle, and user style make exhaustive labeling impossible.

Dynamic timing: only gestures within a specific time window after the cue should be counted.

Model must be small, fast, and run on diverse mobile devices, especially Android.

Algorithm Overview

We adopted a one‑stage object detection framework (SSD) for its speed and memory efficiency, enhanced with feature‑pyramid networks (FPN) for multi‑scale detection and a lightweight backbone (MNasNet) optimized for mobile.

Target Detection

Detection outputs both class and bounding‑box coordinates for each hand. One‑stage models like SSD share convolutions for classification and localization, offering higher speed than two‑stage methods.

SSD Details

Multi‑scale feature maps predict objects at different resolutions.

Anchor boxes of various sizes and aspect ratios serve as priors for bounding‑box regression.

All‑convolutional design reduces memory usage.

Backbone Network

We replaced the original VGG backbone with a mobile‑friendly architecture (MNasNet) discovered via neural‑architecture search, balancing accuracy and latency.

Feature Fusion

FPN combines shallow high‑resolution features with deep semantic features, improving detection of small hands.

Loss Function

We used a combination of smooth L1 loss for localization and sigmoid‑based binary cross‑entropy for classification, removing the ambiguous “other” class and treating the problem as four classes (scissors, rock, paper, background). Focal loss was added to focus training on hard examples.

Data Collection & Annotation

Hundreds of short videos were crowdsourced, each showing users performing RPS gestures. A pre‑trained hand detector extracted hand crops, which were then labeled as rock, paper, scissors, other, or uncertain. Uncertain samples were assigned a weight of zero during training.

Results

The final model size is 1.9 MB, achieving ~17 ms inference on iOS devices and an [email protected] of 0.984 on internal test data. During the Double‑11 event, the game generated 16 million+ page views, 10 million+ unique visitors, and strong merchant feedback.

Future Work

Handling crowded offline scenes with many hands.

Improving detection of very small hands in full‑body shots.

Further optimizing inference speed on low‑end devices.

Overall, the RPS game demonstrates how a well‑engineered, mobile‑friendly object detection pipeline can turn a simple interactive concept into a high‑impact commercial solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

object detection mobile AI SSD Real-time inference hand gesture recognition feature pyramid network

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.