Validity of LLM-simulated users as proxies for real users

Determine whether interactions between large language model agents and LLM-simulated users accurately reflect and predict interactions between those agents and real human users in agentic evaluations, in order to establish the validity of using LLM-simulated users for benchmarking multi-turn, tool-using conversational agents.

Background

Agentic benchmarks often replace human participants with LLM-simulated users to enable scalable, automated evaluation of multi-turn, tool-using conversational agents. However, without direct validation against human interactions, there is a risk that outcomes measured with simulated users do not generalize to real users, potentially leading to miscalibrated assessments of agent capabilities.

This paper studies this concern using τ-Bench retail tasks and a cross-national user study, introducing a Human–LLM calibration metric and documenting systematic miscalibration across difficulty levels and demographic groups—motivating the need to rigorously determine whether simulated interactions reliably predict real human outcomes.

References

Second, without validation with actual users \citep{salaudeen2025measurementmeaningvaliditycenteredframework}, it remains unclear whether interactions between agents and LLM-simulated users accurately reflect and predict interactions between agents and real people (validity, Figure \ref{fig:fig1}).

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations  (2601.17087 - Seshadri et al., 23 Jan 2026) in Introduction