HumDial: Human-like Dialogue Benchmark

Updated 13 January 2026

HumDial is a benchmark for evaluating spoken dialogue systems that emulate human communication through emotional reasoning and real-time, full-duplex interaction.
It features dual tracks—emotional intelligence and interruption handling—with authentic, annotated datasets to test multi-turn dialogues.
The challenge employs multi-modal evaluation protocols and innovative model architectures to enhance conversational fluency, empathy, and system robustness.

Human-like Spoken Dialogue Systems Challenge (HumDial) defines a rigorous, empirical benchmark for spoken dialogue models that aspire to match human communicative capabilities, as measured along emotional intelligence, conversational fluidity, and multimodal paralinguistic expressivity. Launched at ICASSP 2026 to address the frontier challenges introduced by LLMs and Audio-LLMs, HumDial converges state-of-the-art research interests in emotional reasoning, end-to-end speech understanding/generation, and real-time, full-duplex interaction. It establishes dual tracks—Emotional Intelligence and Full-Duplex Interaction—anchored by authentic human conversation datasets with fine-grained annotations, multi-level evaluation protocols, and public infrastructure for leaderboard-based benchmarking (Zhao et al., 9 Jan 2026).

1. Scope and Benchmark Structure

HumDial is structured around two principal tracks reflecting complementary aspects of human-like dialogue:

Track I: Emotional Intelligence focuses on multi-turn emotion recognition, attribution reasoning, and empathetic system response generation. It operationalizes human-like emotional engagement as:
- Task 1: Emotional Trajectory Detection—predicts user emotional state for each turn and the complete trajectory over the dialogue.
- Task 2: Emotional Reasoning—infers underlying causes (“attributions”) for emotion shifts, generating logical “thinking traces.”
- Task 3: Empathy Assessment—requires the system to generate congruent, context-sensitive responses in both text and audio forms.
Track II: Full-Duplex Interaction evaluates systems in real-time, interruption-prone conversational settings, modeling human practices of simultaneous listening and speaking. Subtasks entail:
- Interruption Handling (timely, context-appropriate agent response to user “barge-in” events),
- Rejection Handling (correct identification and ignoring of non-instructional user utterances such as backchannels, side-talk).

Datasets in both tracks feature hybrid synthetic-human recording pipelines. Emotional tracks use LLM-generated dialogue scripts recorded by professional actors to preserve natural prosody and causal transitions; full-duplex tracks embed interruption, hesitation, and backchannels, annotated with dense time-stamped markers and labeled turn types (Zhao et al., 9 Jan 2026, Koudounas et al., 26 May 2025).

2. Task Definitions, Data Design, and Annotation Protocols

Emotional Intelligence Track

Dialogue corpus: 4,800 multi-turn dialogues for each of Task 1/2, with turn-lengths systematically balanced (3–5 turns).
Empathy Assessment task: 38,400 contextually sampled utterances (Zhao et al., 9 Jan 2026).
Dialogue scripting: Gemini 2.5-pro generates scripts embedding coherent emotional trajectories and implicit causal chains (Zhao et al., 9 Jan 2026).
Annotations: Six emotional categories, explicit turn demarcation, and causal event tags per turn.
IEAT data construction: Injects emotion and attribution labels into internal model reasoning traces, producing (context, emotion, cause, response) tuples. Instruction audios extend the diversity and complexity of training scenarios (Wang et al., 8 Jan 2026).

Full-Duplex Interaction Track

Scenarios: Interruption (e.g., follow-up question, topic-switch, silence), rejection (e.g., user real-time backchannel, ambient pauses).
Size: 6,356 interruption utterances (train), 4,842 rejection utterances (train) (Zhao et al., 9 Jan 2026).
Annotation: Time-stamped turn-taking markers, interruption/rejection types, and dialogue roles.
Synthetic elements: DeepSeek scripts for barge-ins, side-talk; human actors maintain authentic overlap and hesitation dynamics.

3. Evaluation Metrics and Protocols

HumDial employs composite multi-modal evaluation protocols combining automatic LLM-based, human-rated, and objective acoustic scoring.

Track I (Emotional Intelligence):

Metric	Formula/Scale	Assesses
Emotion recognition	$\mathrm{Acc} = \frac{\# \text{correct}}{\# \text{turns}}$	Trajectory detection
Reasoning (F1)	$F_1 = \frac{2 P R}{P+R}$	Attribution cause matching
Empathy Score	Weighted avg. (see below)	Response congruence

Final Track I score:

$S = 0.2\,S_{T1} \;+\; 0.2\,S_{T2} \;+\; 0.1\,S_{\mathrm{text} \;+\; 0.25\,S_{\mathrm{emo} \;+\; 0.25\,S_{\mathrm{nat}$

Where $S_{T1}$ (trajectory), $S_{T2}$ (reasoning), $S_{\mathrm{text}}$ ($1$–$5$ LLM-based empathy), $S_{\mathrm{emo}}, S_{\mathrm{nat}}$ (human Likert-scale) (Zhao et al., 9 Jan 2026, Wang et al., 8 Jan 2026).

Track II (Full-Duplex):

Metric	Formula/Interpretation	Assesses
Interruption success rate	$S_{\mathrm{Int}}$ (%)	Appropriate interruption
Rejection success rate	$F_1 = \frac{2 P R}{P+R}$ 0 (%)	Silence, ignored input
Turn-taking latency	$F_1 = \frac{2 P R}{P+R}$ 1 (sec)	Response speed
Overlap ratio	$F_1 = \frac{2 P R}{P+R}$ 2	Simultaneous speaking
Dialogue coherence	LLM-based, $F_1 = \frac{2 P R}{P+R}$ 3– $F_1 = \frac{2 P R}{P+R}$ 4	Flow and consistency

Combined score:

$F_1 = \frac{2 P R}{P+R}$ 5

Evaluation uses Dockerized APIs on RTX A6000s, streaming audio I/O, and real-time injection of interruptions/rejections. Paired Student’s $F_1 = \frac{2 P R}{P+R}$ 6-test establishes statistical significance for system comparisons (Zhao et al., 9 Jan 2026).

4. Model Innovations and Baselines

Leading entries leverage prefix-tuned Audio-LLMs (e.g., Qwen3-Omni-30B, GLM-4-Voice) and multi-task contrastive objectives for trajectory consistency and prosodic emotional encoding (Wang et al., 8 Jan 2026, Zhao et al., 9 Jan 2026). IEAT (Injected Emotional-Attribution Thinking) introduces a regularized loss aligning internal hidden states with emotion-cause vectors, forcing the model to “think emotionally” prior to response generation:

$F_1 = \frac{2 P R}{P+R}$ 7

Stage-wise training:

Phase 1: speech-text alignment, emotional self-distillation
Phase 2: cross-modal joint optimization (text/speech/label heads) (Wang et al., 8 Jan 2026).

Baselines include frozen LLM front-ends with retrieval-based empathy generators and modular supervised pipelines (ASR + dialog manager + TTS + VAD-based interruption detection). Both tracks demonstrate statistically significant human-like gains when employing emotional reasoning modules, full-duplex streaming dialogue engines, and data augmentation variants (Lin et al., 2022, Arora et al., 31 May 2025).

5. Data Resources and Challenge Datasets

HumDial datasets are sourced from authentic recordings (actors, scripted by LLMs) with multi-lingual, multi-domain, and multi-emotion coverage (Zhao et al., 9 Jan 2026, Koudounas et al., 26 May 2025). Notable supporting resources include:

DeepDialogue: 40,150 multimodal emotion-rich multi-turn dialogues with voice synthesis, spanning 41 domains and 20 emotions (Koudounas et al., 26 May 2025).
EChat-200K, EChat-eval: 200K empathetic speech-to-speech dialogues and an empathy evaluation framework integrating real/synthetic cues (Geng et al., 13 Aug 2025).
C³ Benchmark: 1,079 bilingual instances addressing semantic/phonological ambiguity, omission, coreference, and multi-turn dependencies—automatically evaluated by LLM-judge templates with demonstrated alignment to human annotation ( $F_1 = \frac{2 P R}{P+R}$ 8) (Ma et al., 30 Jul 2025).
DialogBench: 12-task probe for coherence, consistency, knowledge use, personalization, and safety in LLM-driven dialogue, with up-to-date multicategory accuracy and robust bias-control (Ou et al., 2023).

6. Analysis, Lessons Learned, and Error Profiles

Empirical analysis reveals several critical insights:

Emotional Tasks: LLM-based systems (TeleAI) achieve near-ceiling accuracy in multi-turn emotion tracking ( $F_1 = \frac{2 P R}{P+R}$ 9), causal inference, and trajectory detection (Wang et al., 8 Jan 2026, Zhao et al., 9 Jan 2026). However, empathetic generation lags by $S = 0.2\,S_{T1} \;+\; 0.2\,S_{T2} \;+\; 0.1\,S_{\mathrm{text} \;+\; 0.25\,S_{\mathrm{emo} \;+\; 0.25\,S_{\mathrm{nat}$0 points (out of 5), signifying a gap between emotional reasoning and expressive empathy (Zhao et al., 9 Jan 2026, Geng et al., 13 Aug 2025). Removing IEAT loss drops performance by $S = 0.2\,S_{T1} \;+\; 0.2\,S_{T2} \;+\; 0.1\,S_{\mathrm{text} \;+\; 0.25\,S_{\mathrm{emo} \;+\; 0.25\,S_{\mathrm{nat}$1 on final scores ($S = 0.2\,S_{T1} \;+\; 0.2\,S_{T2} \;+\; 0.1\,S_{\mathrm{text} \;+\; 0.25\,S_{\mathrm{emo} \;+\; 0.25\,S_{\mathrm{nat}$2, bootstrap) (Wang et al., 8 Jan 2026).
Full-Duplex: Leading full-duplex models (e.g., Cookie_asr) optimize interruption and rejection success rates (79.3%, 72.2%), latency ($S = 0.2\,S_{T1} \;+\; 0.2\,S_{T2} \;+\; 0.1\,S_{\mathrm{text} \;+\; 0.25\,S_{\mathrm{emo} \;+\; 0.25\,S_{\mathrm{nat}$3s), and coherence (79.9/100) (Lin et al., 2022, Zhao et al., 9 Jan 2026). However, direct tradeoffs are observed—systems may emphasize barge-in detection at the expense of silence maintenance (Badcat: 89.7% success, reduced rejection, higher latency).
Ambiguity and Context: Error analyses from C³ reveal persistent challenges in rare sense recognition, coreference resolution, and retention of dialogue context beyond 2–3 turns (Ma et al., 30 Jul 2025). Emotional ambiguities, paralinguistic events, and domain adaptation continue to stress generalization capacity of SDMs.
Biases and Tuning Effects: Instruction tuning yields $S = 0.2\,S_{T1} \;+\; 0.2\,S_{T2} \;+\; 0.1\,S_{\mathrm{text} \;+\; 0.25\,S_{\mathrm{emo} \;+\; 0.25\,S_{\mathrm{nat}$4 points overall accuracy increases in human-likeness dimensions. However, emotional perception and persona tasks remain challenging, with several instruction-tuned models underperforming base counterparts on emotion detection and daily-life fluency (Ou et al., 2023). Domain-specific prompts, multi-turn CoT memory, and emotion inference modules enhance domain adaptation and dialogue expressivity (Arora et al., 31 May 2025, Geng et al., 13 Aug 2025).

7. Impact, Infrastructure, and Future Directions

HumDial provides transparent, community-facing infrastructure (Dockerized eval suites, public leaderboard), detailed task definitions, and reproducible benchmarks for system-level progress (Zhao et al., 9 Jan 2026). Open recommendations include:

Extension of datasets to spontaneous, multi-party and cross-lingual scenarios (beyond dyadic, English/Chinese dialogues).
Fine-grained annotation of sub-emotional states, adversarial noise, and far-field recordings.
Unified Audio-LLM objectives optimizing for emotion perception, causal reasoning, and real-time low-latency generation.
Shared annotation schemas, standardized LLM-judge templates, and open-source baselines (OSUM-EChat, DeepDialogue, CHATS) (Geng et al., 13 Aug 2025, Koudounas et al., 26 May 2025, Mitsui et al., 2023).

By systematically benchmarking and catalyzing the development of human-like spoken dialogue systems, HumDial establishes the empirical and methodological backbone for evaluating advances in emotional intelligence, interactional robustness, and multimodal expressivity in the LLM era.