How AI and RFID Combine to Track Customer‑Product Interactions in Retail

This article presents a comprehensive AI‑driven framework that fuses video‑based customer action detection, RFID‑based product flip detection, and bipartite graph matching to accurately determine when, where, and which customer interacts with which SKU in a retail environment, discussing algorithms, optimizations, and experimental results.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How AI and RFID Combine to Track Customer‑Product Interactions in Retail

Business Analysis

In the wave of AI technology merging with new‑retail scenarios, many creative attempts have emerged, such as pedestrian‑based foot‑traffic counting, face‑based entry statistics, and image‑classification product recognition. To further exploit the value of the "person‑product‑place" digitalization, we aim to understand customer‑product interaction by answering: when, where, who, and which product?

1. Image‑Based Customer Action Detection Algorithm

The algorithm evolved from video‑level action detection to single‑frame image detection to meet deployment constraints.

1.1 Video Action Detection

Customer‑product interactions are temporal (e.g., picking up, flipping, trying on) and thus constitute a video action classification problem. Classic models such as CNN (2014), LRCN (2015), and I3D (2017) were considered. I3D uses 3D convolutions to capture spatio‑temporal features but suffers from high parameter count and over‑fitting in retail scenes.

We built a custom retail‑action dataset, pre‑filtered videos with Lucas–Kanade optical flow to increase positive sample ratio from 0.4% to 5%, and trained models on a distributed GPU cluster.

Model Optimization

Pose‑based hand‑crop videos using OpenPose and DeepSort.

Inception‑V1 C3D (I3D) classifier for hand‑action recognition, with pruning of unnecessary Inception modules and halving channel depths.

The pruned model improved accuracy from 80.5% to 87.0%, and further hyper‑parameter tuning reached 92%.

Conclusion

Challenges include high model complexity, massive video data size, and deployment latency.

1.2 Single‑Frame Image Action Detection

To reduce computation, we adopted MobileNet‑V1 (depth multiplier 0.5) as a lightweight binary classifier for suspicious actions on single images. After balancing the dataset, the model achieved ~89% accuracy with 90% recall.

Processing tens of thousands of images per store takes about ten minutes, dramatically lowering server load compared with video‑based methods.

2. RFID‑Based Product Flip Detection Algorithm

When a customer moves a tagged item, the RFID reader records changes in RSSI and Phase. Supervised models trained on 400×10 feature matrices (50 Hz, 8 s windows) achieved 91.9% accuracy, but struggled with environment changes.

We introduced an unsupervised approach using JS divergence of frequency‑domain representations of RSSI/Phase signals, achieving up to 94% accuracy and better generalization across stores.

3. Bipartite Graph Matching for Person‑Product Association

RFID detection provides product flip timestamps, while image detection supplies suspected customer actions. By matching these events within a 5–15 s window and using a weighted bipartite graph (edge weight = sigmoid(time similarity) × MobileNet action probability × RFID action probability), we applied the Hungarian algorithm to obtain optimal person‑product pairs.

Dividing the graph into disconnected sub‑graphs reduced matching time from hours to minutes for thousands of daily events.

Conclusion

The integrated system—image‑based action detection, RFID‑based flip detection, and bipartite matching—achieves up to 85.8% accuracy in linking customers to specific SKU interactions, demonstrating the feasibility of second‑level, SKU‑level person‑product association in retail.

4. Overall Conclusions and Future Work

Fine‑grained person‑product‑place association is practical.

Unsupervised RFID detection eases large‑scale deployment.

Lightweight image‑based action detection can be deployed at scale.

Future directions include expanding the image dataset, improving hand‑level localization, enhancing RFID hardware capacity, refining unsupervised detection thresholds, and incorporating coarse spatial cues into the matching process.

References

[1] Karpathy et al., "Large‑scale video classification with convolutional neural networks," CVPR 2014.

[2] Donahue et al., "Long‑term recurrent convolutional networks for visual recognition and description," CVPR 2015.

[3] Carreira and Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," CVPR 2017.

[4] Cao et al., "Realtime multi‑person 2D pose estimation using part affinity fields," arXiv 2016.

[5] Wojke et al., "Simple online and realtime tracking with a deep association metric," ICIP 2017.

[6] Howard et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv 2017.

[7] Liu et al., "Tagbooth: Deep shopping data acquisition powered by RFID tags," INFOCOM 2015.

[8] Kuhn, "The Hungarian method for the assignment problem," Naval Research Logistics 1955.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionAIRetailCustomer BehaviorRFID
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.