Validity of LLM-simulated users as proxies for real users
Determine whether interactions between large language model agents and LLM-simulated users accurately reflect and predict interactions between those agents and real human users in agentic evaluations, in order to establish the validity of using LLM-simulated users for benchmarking multi-turn, tool-using conversational agents.
References
Second, without validation with actual users \citep{salaudeen2025measurementmeaningvaliditycenteredframework}, it remains unclear whether interactions between agents and LLM-simulated users accurately reflect and predict interactions between agents and real people (validity, Figure \ref{fig:fig1}).
— Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations
(2601.17087 - Seshadri et al., 23 Jan 2026) in Introduction