Can LLMs Truly Mimic Human Shopping Behavior? The OPeRA Dataset and Evaluation
The paper introduces OPeRA, a step‑wise online‑shopping dataset capturing observations, personas, rationales, and actions from real users, and uses it to benchmark LLMs on next‑action prediction, revealing that even top models like GPT‑4.1 achieve only about 20 % accuracy on fine‑grained actions, with persona information offering limited benefit while rationales prove crucial.
OPeRA dataset
The OPeRA (Observation, Persona, Rationale, Action) dataset records step‑wise online‑shopping interactions from real users via a browser‑plugin. It contains 698 shopping sessions from 51 participants, totaling 28,904 timestamped actions. Each record includes the page observation (content and screenshots), the user’s action, the rationale entered at the decision point, and persona information (age, gender, shopping preferences, etc.).
Next Action Prediction task
Given a user’s historical action sequence, observations, rationales, and persona within a current session, the task asks a model to predict the next action.
Evaluation results
GPT‑4.1 achieves roughly 20 % accuracy on fine‑grained next‑action prediction.
Coarse‑grained metrics such as action‑type classification reach 40 %–50 % F1.
Other mainstream LLMs perform below these levels.
Impact of input components
Adding persona information yields inconsistent gains: it can help coarse tasks but often adds noise for precise action prediction. Removing historical rationales causes a noticeable drop across most metrics, especially for high‑level outcomes, indicating that rationales provide essential intent signals.
Error analysis
Over 60 % of mistakes stem from clicking the wrong button, showing difficulty in locating the correct UI element. Models also frequently generate incorrect search inputs and fail to predict termination behaviors, reflecting a bias toward task completion rather than faithful human imitation.
Conclusion
OPeRA offers a high‑quality, step‑wise dataset for systematic evaluation of LLMs’ ability to simulate human decision‑making. Current models fall short in fine‑grained, persona‑aware prediction, highlighting the need for stronger individualization, multimodal perception, and reinforcement‑learning approaches.
Paper: https://arxiv.org/pdf/2506.05606
Dataset: https://huggingface.co/datasets/NEU-HAI/OPeRA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
