Can LLMs Truly Mimic Human Shopping Behavior? The OPeRA Dataset and Evaluation

The paper introduces OPeRA, a step‑wise online‑shopping dataset capturing observations, personas, rationales, and actions from real users, and uses it to benchmark LLMs on next‑action prediction, revealing that even top models like GPT‑4.1 achieve only about 20 % accuracy on fine‑grained actions, with persona information offering limited benefit while rationales prove crucial.

Machine Heart
Machine Heart
Machine Heart
Can LLMs Truly Mimic Human Shopping Behavior? The OPeRA Dataset and Evaluation

OPeRA dataset

The OPeRA (Observation, Persona, Rationale, Action) dataset records step‑wise online‑shopping interactions from real users via a browser‑plugin. It contains 698 shopping sessions from 51 participants, totaling 28,904 timestamped actions. Each record includes the page observation (content and screenshots), the user’s action, the rationale entered at the decision point, and persona information (age, gender, shopping preferences, etc.).

Next Action Prediction task

Given a user’s historical action sequence, observations, rationales, and persona within a current session, the task asks a model to predict the next action.

Evaluation results

GPT‑4.1 achieves roughly 20 % accuracy on fine‑grained next‑action prediction.

Coarse‑grained metrics such as action‑type classification reach 40 %–50 % F1.

Other mainstream LLMs perform below these levels.

Impact of input components

Adding persona information yields inconsistent gains: it can help coarse tasks but often adds noise for precise action prediction. Removing historical rationales causes a noticeable drop across most metrics, especially for high‑level outcomes, indicating that rationales provide essential intent signals.

Error analysis

Over 60 % of mistakes stem from clicking the wrong button, showing difficulty in locating the correct UI element. Models also frequently generate incorrect search inputs and fail to predict termination behaviors, reflecting a bias toward task completion rather than faithful human imitation.

Conclusion

OPeRA offers a high‑quality, step‑wise dataset for systematic evaluation of LLMs’ ability to simulate human decision‑making. Current models fall short in fine‑grained, persona‑aware prediction, highlighting the need for stronger individualization, multimodal perception, and reinforcement‑learning approaches.

Paper: https://arxiv.org/pdf/2506.05606

Dataset: https://huggingface.co/datasets/NEU-HAI/OPeRA

AILLMevaluationdatasethuman behavior simulationonline shopping
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.