HumDial: Emotional Intelligence Benchmark
- HumDial benchmark is a comprehensive, multi-turn evaluation framework that measures emotional tracking, causal reasoning, and empathetic response generation in spoken dialogue systems.
- It leverages rigorously annotated, actor-recorded speech data to model evolving user emotions and their causal triggers across dialogue turns.
- The benchmark combines objective metrics with LLM-based evaluations to provide actionable insights for advancing emotionally intelligent SDS.
The Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence Benchmark is a systematic evaluation framework for assessing the capacity of spoken dialogue systems (SDS) to recognize, reason about, and respond to emotions in a human-like manner over multi-turn conversations. Its technical depth, structured datasets, rigorous annotation, and diverse assessment protocols position it as a central platform for both diagnosing weaknesses in contemporary SDS models and advancing research toward authentic emotional intelligence in voice-based agents.
1. Benchmark Scope and Objectives
The HumDial Emotional Intelligence Benchmark is designed to address the limitations of prior single-turn or text-only evaluation schemes by providing a multi-turn, speech-centric, and context-rich testbed. Its primary objectives are:
- To evaluate long-term tracking of users' emotional state trajectories across dialogue turns.
- To test systems' causal emotional reasoning, i.e., their ability to infer, explain, and justify changes in emotional state.
- To assess the production of empathetic spoken responses, measuring both linguistic and prosodic alignment with user affect.
- To support rigorous, reproducible benchmarking with fine-grained human annotation and LLM-based evaluation for both objective and subjective system performance (Zhao et al., 9 Jan 2026).
This benchmark reflects a shift in focus: rather than simply classifying isolated emotional cues, it foregrounds the sustained modeling of user affect, the explanation of emotional cause/effect, and the synthesis of contextually congruent, emotionally resonant spoken replies.
2. Dataset Construction and Annotation Protocols
HumDial’s Emotional Intelligence track corpus is composed of recordings and scripts that explicitly model evolving user emotions and their triggers within multi-turn spoken dialogue scenarios:
- Data Composition: A total of 4,800 multi-turn dialogues (3/4/5 turns, evenly split) for training, 600 for development, and 588 for testing, alongside additional critical emotional utterances. In total, the benchmark provides 38,400 training utterances, 2,400 dev utterances, and 2,332 test utterances.
- Data Generation: Seed scripts are created with Gemini 2.5-pro, targeting plausible user cognitive-emotional flows. Scripts model both external events (for trajectory detection) and implicit user intentions (for reasoning).
- Actor Rendering: Professional actors record all utterances, ensuring naturalistic prosody, turn-taking, and affective variation (Zhao et al., 9 Jan 2026).
- Annotation: Each utterance is labeled for one of six balanced emotion categories (joy, sadness, anger, fear, surprise, neutral). Dialogue-level trajectory labels and explicit causal triggers for transitions are annotated, with fine-temporal marking of peak emotional segments to support empathy scoring.
- Task Examples: For emotional trajectory, label sequences must capture plausible development (e.g., neutral → frustrated → resigned); for emotional reasoning, annotators provide human-understandable causal justifications tied to preceding dialogue context.
This construction protocol yields a dataset with dense, multi-resolution affective and causal annotation, supporting analysis of both subtle and extreme emotional dynamics.
3. Task Specifications
HumDial's Emotional Intelligence track operationalizes its goals through three primary tasks:
| Task | Input | Output | Objective |
|---|---|---|---|
| Trajectory | Multi-turn speech-text dialog | Sequence of emotion labels per turn | Track/recognize long-term emotional arc |
| Reasoning | Multi-turn speech-text dialog | Sequence of labels, causal explanation per turn | Infer and justify emotion transitions |
| Empathy | Final turns + peak segment spans | Spoken and textual empathetic response | Generate emotionally appropriate reply |
- Emotional Trajectory Detection: Given a context , should maximize per-turn label accuracy.
- Causal Emotional Reasoning: The system predicts ; explanations are evaluated for factual coherence and causal appropriateness.
- Empathetic Response Generation: Systems must produce responses matching both content and vocal-emotional qualities to the input’s affective signals.
All tasks are formulated for speech and text modalities, supporting assessment of joint understanding and generation abilities (Zhao et al., 9 Jan 2026). The design emphasizes end-to-end, audio-centric evaluation rather than intermediate ASR-NLU-TTS decomposition, which is still permitted for controlled ablation.
4. Evaluation Protocols and Metrics
HumDial employs a weighted composite scoring system, integrating both automation and expert human raters:
- Core Metrics:
Classification Accuracy (Task 1):
Empathy Score (Task 3):
where is a similarity function between reference and system responses measured via LLM-based judgment. - Composite Score: Weighted sum of trajectory, reasoning, textual empathy, emotional appropriateness, and audio naturalness, e.g.,
- Perceptual Scores: 1–5 Likert scale for emotional appropriateness and naturalness, averaged over 20 experts. - LLM-based Scoring: Qwen3-Omni-30B is used for automated evaluation of label accuracy, causal reasoning quality, and empathy.
Additional Measures: For system analysis, the benchmark encourages computation of cross-turn emotional reasoning scores, Concordance Correlation Coefficient (CCC), Macro-F1 for discrete emotion labels, and turn-level Mean Opinion Score (MOS) for naturalness (Liu et al., 25 Aug 2025).
Leaderboard and Held-out Test Sets: Held-out scoring is enforced to reduce overfitting to any single task or metric, especially cross-turn emotion modeling.
No thresholding or calibration is performed; overall rankings are determined by the linear combination of component metrics.
5. System Performance and Principal Findings
Baseline and leading system results on HumDial’s Emotional Intelligence track demonstrate key trends:
Baseline (ASR + GPT-3.5 + Tacotron 2/WaveGlow):
- Task 1 (trajectory): 2.62 / 5.00 (scale)
- Task 2 (reasoning): 2.73 / 5.00
- Task 3 (empathy+naturalness): 2.82 / 5.00
- Error patterns: Baseline systems capture coarse sentiment (neutral ↔ negative) but misclassify fine-grained transitions and consistently underperform in emotional prosodics and congruent empathetic reply synthesis (Zhao et al., 9 Jan 2026).
- Top-ranked Systems: Approaches based on unified SLMs with explicit emotional-attribution modeling (e.g., IEAT) achieve near-human performance on emotional trajectory (4.97/5.00), high scores on causal reasoning (4.98/5.00), but only moderate (3.85–4.14/5.00) on expressive empathy in audio (Wang et al., 8 Jan 2026).
- Identified Failure Modes: The most common include:
- Emotional "overshooting" (excessive or abrupt changes)
- Failure to smoothly decay or correctly reset emotional state across dialogue thematic shifts
- Incomplete integration of affective prosody and linguistic content
- Human/LLM Evaluation Gap: Current LLMs are robust for trajectory/causal reasoning evaluation but still less reliable than expert raters for nuanced, audio-level empathy and naturalness.
Collectively, these results indicate that while logical affective reasoning and high-level emotion tracking approach ceiling, prosody-aligned, contextually empathetic audio generation remains a critical research challenge.
6. Comparative Analysis and Benchmark Integration
HumDial incorporates principles and infrastructures from contemporary EI benchmarks (EMO-Reasoning (Liu et al., 25 Aug 2025), DeepDialogue (Koudounas et al., 26 May 2025), EchoMind (Zhou et al., 26 Oct 2025), MULTI-Bench (Deng et al., 2 Nov 2025)) and offers important distinctions:
- Versus EMO-Reasoning: HumDial adopts EMO-Reasoning’s multi-turn, speech-based evaluation and Cross-turn Emotion Reasoning Score, but adds human-acted data, richer emotion/trigger annotation, and empathetic response production tasks (Liu et al., 25 Aug 2025).
- Versus DeepDialogue: DeepDialogue contributes large-scale, multi-domain, LLM-generated, and TTS-synthesized emotional data, supporting similar emotion progression modeling and emotional consistency criteria (Koudounas et al., 26 May 2025). HumDial, however, centers on professional acted speech and explicit causal annotation.
- Versus EchoMind: EchoMind advocates a pipeline from content understanding → vocal cue perception → integrated reasoning → empathetic response, using neutral scripts and parametric vocal-attribute control vectors. While EchoMind foregrounds discrete non-lexical cues and environmental factors, HumDial’s design is organized around emotional trajectory and reasoning with rich, real-world conversational contexts (Zhou et al., 26 Oct 2025).
- On Metric Families: All leading benchmarks converge on the need for both objective measures (accuracy, F1, CCC, BLEU, structured consistency rates) and subjective, multidimensional, human or LLM-based ratings (context fit, empathy, prosodic appropriateness), with continual emphasis on speech-centric evaluation.
- Extensions for HumDial recommended by these works include: three-way prompt control (neutral/target/alternative), upper-bound evaluation via ideal attribute input, and formalization of vocal attribute parameterizations (e.g., one-hot or continuous prosody controls as in EchoMind).
7. Implications and Future Directions
HumDial establishes a new standard for evaluating emotionally intelligent spoken agents, but it also exposes core technical barriers and opportunities:
- Empathetic Speech Generation: Fine-grained prosody control and multimodal end-to-end optimization are required to close the gap in natural-sounding, affect-aligned response synthesis.
- Contextual Reasoning: Integration of external event triggers, latent user goals, and causal chains remains an area where learning-based models benefit from explicit supervision, curriculum learning, and attribution-injection strategies (e.g., IEAT) (Wang et al., 8 Jan 2026).
- Robustness and Bias: Analysis of model robustness to domain, demographic factors, and linguistic diversity is ongoing, with bias/fairness probes embedded in the evaluation pipelines, echoing recommendations in DeepDialogue.
- Unified Audio-LLMs: Joint modeling of understanding and generation, with shared semantic core across modalities and layered cross-attention, is a promising direction.
- Multi-level, Multi-modal Task Structure: Insights from EchoMind suggest extending HumDial to additional stages—such as integrated reasoning with environmental and paralinguistic cues—and systematically quantifying the upper bound via ideal-cue control.
- Interactive and Human-in-the-Loop Evaluation: Online A/B testing, incremental prompt sensitivity analysis, and reinforcement learning from user feedback are identified as crucial for real-world deployment calibration.
By integrating large-scale, fully-annotated, actor-recorded data with hierarchical and perceptually validated metrics, the HumDial Emotional Intelligence Benchmark provides a stringent, reproducible reference for advancing the state of the art in emotionally intelligent spoken dialogue systems (Zhao et al., 9 Jan 2026).