Introducing LifeSim: The First Long‑Horizon User Life Simulator Redefining Personalized LLM Evaluation

LifeSim introduces a long‑horizon user life simulation framework that jointly models user cognition via a BDI engine and external environment, enabling realistic evaluation of personalized LLM assistants through the LifeSim‑Eval benchmark, which reveals current models excel at explicit intents but struggle with hidden intents and long‑term user understanding.

Machine Heart
Machine Heart
Machine Heart
Introducing LifeSim: The First Long‑Horizon User Life Simulator Redefining Personalized LLM Evaluation

Recent advances in large language models (LLMs) have accelerated progress on personalized assistant tasks, but existing benchmarks remain detached from real‑world user‑assistant interactions. Two main limitations are identified: (1) complex external environments (time, location, weather, life events) that shape user needs, and (2) dynamic user cognition where intentions are influenced by long‑term preferences, personality traits, recent experiences, and current mental state. Public long‑term interaction data are scarce because of privacy and ethical constraints, making realistic evaluation difficult.

To address these challenges, researchers from Fudan University and Shanghai Chuangzhi Institute propose LifeSim , a long‑horizon user life simulation framework for personalized assistant evaluation. LifeSim models both internal user cognition and external physical environment, generating coherent life trajectories, event sequences, and multi‑turn interactions. The framework consists of four components:

User profile pool : millions of synthetic users, each with demographic attributes, Big‑Five personality traits, and long‑term preferences.

BDI‑based cognitive engine :

Belief – long‑term profile combined with short‑term situational awareness.

Desire – current needs matched from a real‑user demand library.

Intention – actionable goals derived from profile, recent experiences, and current context.

Environment‑driven event engine : grounds simulations in real travel trajectories, injecting time and location factors to produce plausible life events.

User‑behavior engine : generates multi‑turn dialogue by modeling memory perception, emotional reasoning, and action selection, ensuring responses align with the user profile, context, and naturalness. Both automatic and human evaluations confirm the behavior engine’s effectiveness.

Building on LifeSim, the authors introduce LifeSim‑Eval , an evaluation suite that tests models on 120 users across 1,200 scenarios covering eight common life domains. Two evaluation modes are defined:

Single‑scene mode : a dialogue of up to 20 turns with the simulated user in the current scene only.

Long‑horizon mode : responses must consider the entire interaction history.

Core metrics include intent recognition, intent completion, preference reconstruction, profile alignment, and response naturalness/coherence.

Experiments evaluate a range of mainstream LLMs (GPT‑5, GPT‑4o, Claude Sonnet 4.5, DeepSeek‑V3.2, Qwen, Llama, gpt‑oss). Key findings:

Models handle explicit intents well but lag >20 points on hidden intents in single‑scene settings.

In long‑horizon mode, explicit‑intent performance remains stable, while hidden‑intent performance degrades further as history length grows.

Simple memory mechanisms provide limited gains; augmenting models with explicit preference‑updating yields inconsistent improvements.

Three typical failure patterns emerge: rigid reasoning that sticks to early solutions, insufficient proactive questioning, and under‑utilization of available user profiles.

These results indicate that while current LLMs can address surface‑level user requests, they struggle with implicit intent inference, long‑term preference reasoning, and dynamic adaptation in realistic, multi‑day life scenarios. LifeSim and LifeSim‑Eval thus offer a more faithful simulation environment and a new avenue for generating personalized synthetic data.

Paper: "LifeSim: Long‑Horizon User Life Simulator for Personalized Assistant Evaluation" (arXiv:2603.12152). Repository: https://github.com/dfy37/lifesim

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

user modelingLLM evaluationBDI modelLifeSimlong‑horizon simulationpersonalized assistant
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.