FURINA-Bench: Customizable RP Benchmark

Updated 8 December 2025

FURINA-Bench is a fully customizable role-playing benchmark that systematically evaluates LLMs' dialogue and persona simulation using both literary and synthesized character data.
It employs a multi-agent pipeline with dynamic scheduling and fine-grained evaluation criteria across dimensions like context reliance, factual recall, and conversational ability.
Experimental results reveal trade-offs between reasoning fluency and hallucination rates while highlighting scaling effects and reliability challenges in modern LLMs.

FURINA-Bench is a comprehensive, fully customizable role-playing (RP) benchmark designed for the rigorous evaluation of LLMs in synthetic character dialogue and persona simulation. Built atop the FURINA-Builder multi-agent pipeline, FURINA-Bench enables systematic, dimension-specific assessment of both established literary/persona characters and fully synthesized personas across diverse scenarios and prompt formats. The benchmark incorporates fine-grained evaluation criteria and facilitates large-scale, bilingual (English–Chinese) RP task analysis, offering deep insights into model performance, reliability, scaling effects, and the fundamental trade-offs in RP reasoning and hallucination (Wu et al., 8 Oct 2025).

1. Pipeline Architecture and Dialogue Simulation

FURINA-Bench is generated via FURINA-Builder, an LLM-driven multi-agent system that simulates authentic, multi-turn RP dialogues through orchestrated agent collaboration. The pipeline consists of:

Test-Character agent: Accepts persona attribute dictionaries $\{(k_i,v_i,\tau_i)\}$ , distinguishing between public and private fields to simulate realistic persona boundaries. Synthesized characters are iteratively constructed using LLMs to form multidimensional profiles and prototype exchanges.
Character–Scene Pool ( $S$ ): Comprising 6,556 scenario fragments from 100 English and 80 Chinese literary sources, each scenario encapsulates background ( $B$ ), motivation ( $M$ ), original dialogue excerpt ( $D_{\mathrm{orig}}$ ), and participating characters ( $C_{\mathrm{scene}}$ ). The pool is user-extendable.
Director & Scene-Character agents: Scenario management is handled by a Director (LLM) that schedules turns, with scene characters role-played via prompt-rich context injection.
Judge agent (GPT-4.1): Employs chain-of-thought reasoning to select evaluation dimensions and adjudicate between candidate responses. Each turn is dynamically scheduled for coverage balance using Dynamically Weighted Random Selection (DWRS):

$w_i = c_{\max} - c_i + 1, \qquad P(d_i) = \frac{w_i}{\sum_j w_j}$

Selected replies update dialogue history, and superior turns are recorded as benchmark test items.

Simulation proceeds iteratively, alternating between scene character and test character turns, with prompt construction reflecting persona, scenario context, and dimension-specific strategies $S(d^*)$ . Only responses rated superior ( $\sigma\leq3$ ) by the Judge are retained for benchmark inclusion.

2. Dataset Construction and Customization Features

FURINA-Bench comprises:

20 Test Characters: Evenly split (10 English, 10 Chinese) between “Established” (pretrained, literary/persona-based) and “Synthesized” (created de novo), enabling comparative analysis of memorized knowledge versus in-context instruction following.
1,459 Group-chat Dialogues: Average 19.8 turns per session, featuring 1,471 unique scene characters. Each test turn is mapped to one of five evaluation dimensions, ensuring ≥500 utterances per dimension per language.
Customization: Users can override character profiles, scenario pool $S$ , privacy toggles, source/base models, evaluation dimensions ( $D_{\mathrm{eval}}$ ), and coverage threshold ( $\tau$ ). The builder supports single-player, multi-party, and multi-format benchmarks in any language.
Scalability: LLM-driven and parallelizable, supporting flexible scenario sampling and scaling. Dataset generation consumed $\sim$ 6.5K scenario fragments and ~4,000 simulation hours, yielding a filtered set of $\sim$ 11.5K candidate turns.

Post-processing involves (i) LLM-based quality filtering to remove incoherent turns and (ii) rule-based formatting checks. Only superior (source > base) utterances are retained to control difficulty and ensure baseline competitiveness.

3. Evaluation Methodology and Metrics

Each FURINA-Bench test instance is structured as $\langle H_i, u_i, d_i \rangle$ , where $H_i$ is history, $u_i$ the test utterance, and $d_i$ the dimension. Evaluation consists of:

Test vs. Base Model Pairing: Both models respond to the same prompt; the Judge rates the pair twice to control order bias, using a 5-point Likert scale ( $\sigma\in\{1,\ldots,5\}$ ).
Score Mapping: Unbalanced mapping $f(\sigma)$ as:

$f(1) = 3,\quad f(2) = 1,\quad f(3) = 0.5,\quad f(4) = 0,\quad f(5) = 0$

Per-instance score:

$\mathrm{Score}_i = \tfrac{1}{2}\big[f(\sigma_{i,1}) + f(6-\sigma_{i,2})\big]$

Normalization: Overall RP performance is computed as:

$\mathrm{Performance} = \frac{\sum_{i=1}^N \mathrm{Score}_i}{3N}$

Five Evaluation Dimensions:

Context Reliance (CR)
Factual Recall (FR)
Reflective Reasoning (RR)
Conversational Ability (CA)
Preference Alignment (PA)

Reliability is rigorously assessed: GPT-4.1 reaches 89.2% accuracy in human-labeled dimension selection, and judge-to-human Pearson correlations average $r=0.63$ over 400 comparisons.

4. Experimental Results and Empirical Insights

Major findings from FURINA-Bench evaluations include:

Language-Specific Model Superiority: DeepSeek-R1 achieves 73.38% normalized score on Chinese RP tasks; o3 leads English with 43.98%. Qwen3-32B/235B(“thinking”) models excel on FR, CA, PA in Chinese.
Scaling Effects: Qwen3 family exhibits monotonic performance gains with increasing parameter count (8B $\rightarrow$ 32B $\rightarrow$ 235B) in Chinese, illustrating the impact of targeted pretraining.
Established vs Synthesized Persona Disparity: Established characters consistently outperform synthesized (absolute gap $0.04–19.0$ points), with reflective/reasoning variants amplifying this difference (up to $\sim$ 12 points). This demonstrates the prominence of in-model memorized priors over in-context synthetic persona construction.
Reasoning–Hallucination Trade-off: Chain-of-thought (“thinking” mode) elevates RP scores ( $+$ 10–12 points for Qwen3-32B), but systematically raises hallucination rates (1.5 $\times$ –2 $\times$ for both EC and SC categories). Model scaling does not monotonically abate hallucination; in some cases, 8B models hallucinate less than 32B.
Pareto Frontier of Performance vs Reliability: High-performance models occupy the upper left of the RP score vs reliability plot (reliability computed as $100/\text{hallucination\_rate}$ ), while conservative models (GPT-4o) favor reliability at the expense of RP fluency. Reasoning variants define a Pareto frontier, with improved performance offset by decreased reliability.

5. Theoretical and Practical Implications

FURINA-Bench exposes fundamental considerations for RP benchmark and LLM evaluation:

Multifaceted Benchmarking: Measuring fluency, reasoning, persona fidelity, and hallucination simultaneously is essential; single metrics obscure complex trade-offs.
Limitations of In-Context Synthesis: Pretrained knowledge still dominates over pure synthetic persona construction. A plausible implication is that hybrid approaches—combining fine-tuned synthetic personas with memory-augmented reasoning—may advance RP fidelity.
Performance–Reliability Trade-off: The observed Pareto frontier suggests instruction-sensitive hallucination mitigation and controlled chain-of-thought verbosity are unresolved challenges.
Modularity for Task-Specificity: The builder’s design supports rapid adaptation to different RP domains (e.g., educational tutors, narrative NPCs, expert agents) via modular swaps of scene-pools, persona profiles, and evaluation sets.

6. Directions for LLM and Benchmark Development

Future progress in RP LLMs and benchmarks, informed by FURINA-Bench, is expected to depend on:

Balanced, Diverse Pretraining Data: Integrating factual grounding with rich persona variation.
Post-Training Hallucination Control: Developing penalties or constraints on RP hallucinations that do not impede reasoning.
Adaptive Evaluation Pipelines: Maintaining relevance as LLM capabilities evolve by leveraging dynamic, scalable evaluation frameworks.
Public Accessibility: Open availability of both the builder and benchmark is positioned to accelerate research into reliable, high-fidelity RP agents.

FURINA-Bench sets a precedent for customizable, nuanced RP assessment and reveals critical technical bottlenecks and strategic directions for the next generation of LLM-based role-playing agents (Wu et al., 8 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FURINA-Bench.