MIRRORBENCH Evaluation Framework
- MIRRORBENCH is an extensible benchmarking framework that rigorously evaluates LLM user proxies solely based on their human-likeness in dialogue.
- It employs a modular architecture with typed interfaces, metadata registries, and multi-backend support to ensure reproducible and scalable assessments.
- The framework integrates diverse metrics, including lexical diversity and LLM-judge realism, to provide comprehensive, variance-aware evaluations of conversational simulations.
MIRRORBENCH is an extensible benchmarking framework designed to rigorously evaluate user-proxy agents—particularly LLMs prompted to simulate human users—on the sole criterion of human-likeness in dialogue. It is architected to support reproducible, scalable, variance-aware assessment across arbitrary tasks, datasets, proxies, and metrics, with explicit separation from downstream task success. MIRRORBENCH’s design emphasizes modularity, typed interfaces, metadata-driven registries, multi-backend support, and rich observability, enabling principled evaluation and comparison of user simulators in conversational AI (Hathidara et al., 13 Jan 2026).
1. Core Objectives and Evaluation Philosophy
MIRRORBENCH is predicated on three foundational requirements: reproducibility, extensibility, and scalability/observability. Every run is replayable with bit-identical results due to explicit manifest emission capturing all proxies, datasets, metrics, seeds, prompts, and decoder arguments. The framework’s typed interfaces and modular registry enable plug-and-play integration of new LLM providers, proxy adapters, datasets, tasks, or metrics without core code modification. By treating user proxies as opaque agents evaluated solely for human-likeness—defined as the statistical and stylistic congruence of simulated utterances with real user data—the platform avoids conflation with task-performance metrics and exposes systematic deviations in simulators caused by naïve prompting or over-optimization (Hathidara et al., 13 Jan 2026).
2. Modular Architecture and Execution Engine
The architecture comprises a six-layer stack detailed as follows:
- Layer 1: Execution backends (synchronous, asynchronous, Ray distributed runners) and persistence (SQLite WAL database for run, unit, episode, metric, and aggregate statistics).
- Layer 2: Core engine with typed message, episode, metric, and configuration artifacts (TypedDict/pydantic), and a registry builder with metadata for component compatibility.
- Layer 3: Orchestration via pipeline planner, run controller, cache deduplication (content-based keys per LLM/judge call), and structured logging/telemetry via OpenTelemetry.
- Layer 4: Pluggable model clients, adapters, datasets, and metrics. All declare prerequisites and interfaces for orchestration and reporting.
- Layer 5: Task drivers supporting single-turn and multi-turn “mirror conversation” evaluation (alternating user proxy and assistant up to reference dialogue length).
- Layer 6: Python API and CLI supporting plan/dryrun/run/report/cache and direct inspection or deletion of runs. Manifest.json encapsulates configuration for precise reproducibility.
A manifest is generated per evaluation plan, enumerating all (proxy, dataset, metric, seed) units and prompt texts. This design enforces compatibility—e.g., blocking multi-turn drivers with single-turn-only proxies—and guarantees reproducibility across compute environments.
3. Evaluation Protocol and Metrics
MIRRORBENCH implements two metric families: lexical diversity and LLM-judge realism.
Lexical Diversity Metrics:
- MATTR (Moving-Average Type-Token Ratio): quantifies type-token richness across a sliding window (), normalized to human reference distribution via z-scoring.
- Yule’s K: measures concentration of word usage (higher K = lower diversity), normalized against human references.
- HD-D (Hypergeometric Distribution Diversity): computes probability of observing each type in random samples, normalized over s draws.
All scores are z-scored against empirical human statistics .
LLM-Judge-Based Metrics (using chain-of-thought prompts and optional self-consistency):
- GTEval: average judge score comparing user proxy and human reference.
- Pairwise Indistinguishability (PI): judge selects between proxy and human conversation; Aw centers at 0 for perfect indistinguishability.
- Rubric-and-Reason (RNR): reference-free evaluation using a fixed rubric; mean binary score per episode.
Metrics optionally incorporate judge-model calibration via HH (human-human) and PP (proxy-proxy) conditions, with affine rescaling:
4. Supported Datasets and Tasks
MIRRORBENCH ships four preprocessed English datasets containing authentic human user turns:
- QULAC (query clarification)
- ClariQ (clarifying questions in information-seeking)
- OASST1 (open instruction dialogues, English subset)
- ChatbotArena (arena dialogues with LLMs and humans)
Each episode consists of reference dialogue, user goal, and metadata. An auxiliary LLM goal generator is used when explicit goals are absent, ensuring comparability of user proxy to human reference. Task drivers roll out single or multi-turn mirror conversations; proxy utterances are compared to references using the full metric suite.
5. Experimental Design, Quantitative Results, and Insights
Variance and confidence intervals are estimated by repeating runs over multiple random seeds, aggregating per-unit means, sample standard deviations, and 95% confidence intervals via Student’s t-distribution. Key findings (Hathidara et al., 13 Jan 2026):
- Across four datasets and five user-proxy LLMs (GPT-4o, GPT-5, GPT-OSS-120B, Claude-4-Sonnet, Gemini-2.5-Pro), judge-based realism metrics (GTEval, PI Aw, RNR) yield a consistent ordering:
- Gemini-2.5-Pro ≈ Claude-4-Sonnet > GPT-4o > GPT-5, GPT-OSS-120B
- Lexical diversity metrics (z-scores): proxies overshoot human diversity on ClariQ (positive MATTR z), undershoot on QULAC (negative MATTR, positive Yule’s K), with ChatbotArena and OASST1 intermediate.
- There exists a tension: proxy agents scoring highly on judge realism metrics do not always match human diversity on surface statistics, confirming the necessity of both metric families.
- LLM judge sensitivity analysis indicates verdicts can shift with the choice of judge model, motivating calibration and ensemble approaches.
6. CLI Workflow and Reproducibility Protocols
A full experiment is executed as follows:
- Create a YAML config specifying proxies, datasets, metrics, drivers, and parameters.
- Run
mirrorbench plan -c config.yamlto validate and emit manifest.json. - Execute
mirrorbench dryrun -c config.yamlfor compatibility checks. - Launch
mirrorbench run -c config.yamlto dispatch evaluation units (all runs locally or distributed), which are cached and persist results in SQLite. - Generate aggregated reports with
mirrorbench report json <run_id> -o report.json, including calibration controls, episode-level scores, and telemetry.
All experiment prompts, model names, seeds, and decoding params are guaranteed by manifest inclusion, ensuring identical reproduction across environments.
7. Broader Significance and Limitations
MIRRORBENCH exposes systematic deviations between simulated and authentic user utterances, emphasizing that human-likeness evaluation for user proxies must not be conflated with downstream chatbot performance. Its plugin architecture, variance-controlled analytics, and rigorous metric suite provide a standardized experimental harness for future research and benchmarking in conversational AI. The approach remains sensitive to judge-model selection and calibration; extensions may integrate new metrics, multi-judge ensembles, or expanded task domains.
By unifying reproducibility, extensibility, and explicit metric-driven analysis in a modular framework, MIRRORBENCH delivers a robust foundation for evaluating and improving the fidelity of user-proxy simulations in dialogue-centric AI systems (Hathidara et al., 13 Jan 2026).