Artificial Intelligence 6 min read

Why Multi‑Turn LLM Evaluation Fails and How a User‑Simulator Can Fix It

The article explains that large language models lose up to 35% performance in multi‑turn conversations, critiques static single‑turn evaluation methods, and proposes a dynamic user‑simulator with loss‑masking techniques to generate realistic test turns and improve assessment reliability.

Baobao Algorithm Notes

May 16, 2025

Why Multi‑Turn LLM Evaluation Fails and How a User‑Simulator Can Fix It

Performance drop in multi‑turn dialogue

Recent studies show that large language models (LLMs) lose about 35 % of their accuracy when the conversation extends beyond a single turn. For example, a model that achieves 95 % accuracy on isolated questions drops to roughly 73 % after five turns.

Why conventional evaluation is insufficient

Typical evaluation splits a multi‑turn session (Q1 A1 / Q2 A2 / Q3 A3) into independent single‑turn samples: Q1→A1, Q1 A1 Q2→A2, Q1 A1 Q2 A2 Q3→A3. This ignores the causal dependency that the model’s answer in turn t influences the user’s input in turn t+1, which is essential in role‑play, phone‑call, and agent scenarios. Consequently, static datasets cannot capture session collapse caused by early errors.

User simulator as a dynamic test generator

A user simulator is a model trained to generate the next user utterance conditioned on the current model answer. Training focuses on the “input” side: real user‑side utterances are used as targets, or synthetic questioning data is constructed. The core idea is to reverse the usual input‑output direction—treat the LLM’s answer as part of the next prompt.

Training with multi‑turn loss masking

One effective technique is multi‑turn loss‑masking: during fine‑tuning, the tokens that belong to the model’s normal output are masked as inputs, while the prompt tokens remain unmasked and contribute to the loss. This allows token‑level control over which tokens are included in policy‑gradient updates, which is especially useful when some tokens (e.g., tool‑generated responses) should be excluded from learning.

Evaluation workflow

Fine‑tune a user‑simulator on user‑side data.

During evaluation, let the simulator generate a user utterance, feed it to the target LLM, obtain the answer, and feed that answer back to the simulator for the next turn.

By varying the simulator’s prompt template or seed data, thousands of diverse simulated users can be produced, far exceeding the coverage of a handful of human annotators.

Adversarial RLHF potential

The same framework can be extended to an adversarial RLHF loop where the user simulator and the target model are trained jointly as opponents. The SGlang project already provides multi‑turn RLHF support; the repository https://github.com/microsoft/lost_in_conversation contains an implementation of the loss‑masking mechanism and example scripts.