How Alibaba Digitizes In‑Store Foot Traffic with AI and RFID Fusion

This article details Alibaba's end‑to‑end solution for digitizing offline retail foot traffic, combining existing surveillance cameras, RFID tags, and advanced AI techniques such as lightweight YOLO detection, knowledge distillation, and multi‑level pedestrian re‑identification to capture, analyze, and act on shopper behavior for both business operations and personalized in‑store experiences.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba Digitizes In‑Store Foot Traffic with AI and RFID Fusion

Overall Solution

The proposed system covers the entire customer journey, from off‑site product selection guidance, through in‑store traffic attraction, crowd profiling, trajectory tracking, product‑person interaction data collection, smart fitting‑room recommendations, to post‑visit online re‑engagement. It forms a complete pipeline that links offline behavior to online traffic distribution.

Foot Traffic Digitization Exploration

Hardware deployment leverages existing store surveillance cameras and RFID tags, combined with visual and radio‑frequency technologies processed on edge GPU terminals. The solution uses facial attribute detection, pedestrian detection and cross‑camera re‑identification to generate heatmaps and track movement, while fusing camera and RFID data to pinpoint product browsing and try‑on events.

Pedestrian Detection

Accurate detection of pedestrians, their gender, age, and visit frequency is essential for comprehensive analysis. While YOLO‑based detectors achieve near‑real‑time performance on high‑end GPUs, they are too heavy for edge deployment. We therefore optimized the model for the retail environment.

Network Structure Simplification and Optimization

Building on the YOLO framework, we introduced a lightweight real‑time detector using Tiny DarkNet as the backbone, combined with Spatial Pyramid Pooling and feature fusion. The resulting model reduces size to 1/10 of the original, drops performance by less than 2%, and runs at 268% higher FPS on a Tesla P4, enabling deployment on mobile and chip‑level devices.

Knowledge Distillation Further Optimization

We applied knowledge distillation, using a full‑size YOLOv3 as the teacher and our compact model as the student. By designing a hint‑layer loss and focusing on unstable regression predictions, we achieved additional compression without sacrificing accuracy, effectively performing online hard‑example mining.

Pedestrian Re‑identification

Re‑identification links the same shopper across multiple cameras, enabling analysis of movement between store zones. Direct global feature extraction struggles with pose variation, occlusion, and similar clothing.

Pedestrian Feature Extraction

Local‑feature based methods divide the image into multiple parts, allowing the model to focus on discriminative details such as clothing patterns, which improves robustness against viewpoint changes and lighting differences.

Multi‑level Local Feature Fusion

Our architecture adds multi‑scale local features and a global branch, learning both fine‑grained part details and holistic representations. This fusion boosts Rank@1 on the Market dataset to 96.19%, outperforming both pure global (89.9%) and pure local (92.5%) baselines.

Cross‑Dataset Re‑identification Exploration

To adapt models to diverse store environments without per‑store labeling, we incorporated spatio‑temporal patterns using a mixed model that combines visual classifier scores with camera transition probabilities, enabling unsupervised domain adaptation.

Human‑Object Action Detection

Beyond trajectory, we capture shopper‑product interactions. By fusing visual cues with RFID signal changes (RSSI, Phase), we detect when a tag‑equipped item is moved. Frequency‑domain features derived from FFT of the RFID signal, together with JS‑divergence between consecutive samples, improve motion detection accuracy to 94%.

For visual action classification, we fine‑tuned MobileNet on pedestrian detections, optimizing logits to increase recall of positive examples while maintaining precision.

Foot Traffic Digitization Applications

The collected data supports both offline operations and offline‑to‑online traffic distribution. By integrating with existing interactive screens, we deliver personalized coupons and recommendations, guiding shoppers from entry points to specific stores and from stores to tailored product suggestions.

External Attraction Screens

These screens engage users with interactive games and personalized offers, directing them toward target stores based on demographic matching and store traffic statistics.

In‑Store Fitting Screens

Smart mirrors display product details, discounts, and complementary recommendations while shoppers try on items, using the captured interaction data to personalize suggestions without relying on facial ID.

Engineering Implementation

AI inference initially ran in a serial pipeline, underutilizing CPU and GPU resources. We redesigned it into a multi‑process pipeline with shared memory (mmap + ctypes) for fast inter‑process communication, achieving a 300‑fold speedup over pipe‑based transfer and allowing simultaneous processing of up to 16 video streams on a GTX 1070.

Business Effect

The digitized foot‑traffic data is now integrated into Alibaba's retail products, providing merchants with daily visitor counts, hourly heatmaps, and demographic breakdowns. Stores can adjust layout, product placement, and promotional strategies based on real‑time crowd density and movement patterns, leading to more efficient operations and higher conversion rates.

References

[1] Redmon J, Farhadi A. YOLOv3: An Incremental Improvement. arXiv preprint arXiv:1804.02767, 2018.

[2] Hinton G, Vinyals O, Dean J. Distilling Knowledge in a Neural Network. NIPS Workshop, 2014.

[3] Romero A, Ballas N, Kahou S E, et al. FitNets: Hints for Thin Deep Nets. arXiv preprint arXiv:1412.6550, 2014.

[4] Chen G, Choi W, Yu X, et al. Learning Efficient Object Detection Models with Knowledge Distillation. Advances in Neural Information Processing Systems, 2017.

[5] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, Shengjin Wang. Beyond Part Models: Person Retrieval with Refined Part Pooling.

[6] Wei Li, Xiatian Zhu, Shaogang Gong. Person Re‑Identification by Deep Joint Learning of Multi‑Loss Classification, IJCAI 2017.

[7] Jianming Lv, Weihang Chen, Qing Li, Can Yang. Unsupervised Cross‑Dataset Person Re‑identification by Transfer Learning of Spatial‑Temporal Patterns, CVPR 2018.

[8] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, Qi Tian. Scalable Person Re‑identification: A Benchmark, ICCV 2015.

[9] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, Xi Zhou. Learning Discriminative Features with Multiple Granularities for Person Re‑Identification, MM 2018.

[10] Liu S, Qi L, Qin H, et al. Path Aggregation Network for Instance Segmentation. CVPR 2018.

[11] Tianci Liu, Lei Yang, Xiang‑Yang Li, Huaiyi Huang, Yunhao Liu. TagBooth: Deep Shopping Data Acquisition Powered by RFID Tags. INFOCOM 2015.

[12] MobileNet v1 documentation: https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionEdge ComputingAIRetail analyticsRFIDPedestrian Re-identification
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.