NC-Bench: Natural Conversation Benchmark
- NC-Bench is a family of evaluation protocols and datasets that measure conversational competence in language, speech, and ASR models.
- It operationalizes over 120 dialogue patterns from conversation analysis to assess models’ abilities to manage, repair, and close interactions.
- The benchmark includes text, audio, and ASR variants with tailored metrics such as accuracy and word error rate to uncover specific dialogic deficits.
The Natural Conversation Benchmark (NC-Bench) is a family of evaluation protocols and datasets designed to rigorously assess conversational competence in LLMs, speech models, and automatic speech recognition (ASR) systems. Unlike benchmarks that emphasize content accuracy or narrow task completion, NC-Bench isolates the structural, sequential, and dialogic behaviors that are fundamental to natural conversation, enabling quantification and comparison of models' abilities to manage, repair, and coordinate dialogue using the normative patterns observed in human interaction (Moore et al., 10 Jan 2026, Zhang et al., 27 Jun 2025, Castillo-Bolado et al., 2024, Maheshwari et al., 2024).
1. Theoretical Foundations and Rationale
NC-Bench is grounded in Conversation Analysis and the IBM Natural Conversation Framework (NCF), which models human conversation as a set of over 120 generic patterns, each formalized as a prototypical sequence of dialogue acts (e.g., “inquiry→answer,” “repair→paraphrase,” “closing→pre-closing”) (Moore et al., 10 Jan 2026). The framework is domain-invariant, capturing behaviors underlying both spoken and written human exchanges. NC-Bench operationalizes representative subsets of NCF patterns to probe whether models can execute sequence-management actions such as answering, repairing, and closing, which collectively characterize "knowing how to talk"—not just "what to say." This approach contrasts with benchmarks focused on task, topic, or factual recall, foregrounding conversational form and structure over content.
2. Core Constructs and Evaluation Methodologies
Each NC-Bench variant presents conversation fragments ending with a user turn and prompts a model to generate the next turn. The test defines the set of acceptable dialogue acts for continuation, as specified by formal NCF patterns:
Let be a partially instantiated NCF pattern, where is a dialogue act (e.g., Inquiry, Answer, Repair, SequenceCloser).
The model's response is classified by a judge (automatic or LLM-based) into a dialogue-act label , and scored as: for the set of acceptable next acts . Overall accuracy for set of prompts is then
Dialogue acts include but are not limited to: Answer, NonAnswer, Repeat, Paraphrase, Definition, Example, RepeatRequest, ParaphraseRequest, DefinitionRequest, ExampleRequest, SequenceCloser, SequenceAbort (Moore et al., 10 Jan 2026).
3. Benchmark Structure and Variants
NC-Bench encompasses multiple evaluation suites, each targeting distinct conversational modalities and challenges. The major text-based variant, as introduced by IBM researchers, consists of:
Basic Conversation Competence:
Evaluates fundamental sequence management in ordinary conversation. It covers nine patterns (Inquiry, Incremental Request, Self-Correction, four repair signals, Sequence Closer, Sequence Abort) with 20 instances each (180 items total). Prompts are short transcripts, typically two to three turns, requesting the correct agent next action.
RAG Set (Retrieval-Augmented Generation):
Adds the requirement to ground answers in an external document. Patterns mirror Basic, but prompts also provide a context passage; the Inquiry pattern is subdivided into Grounded (answerable from the passage) and Ungrounded (passage does not cover the answer, correct act is NonAnswer). Instruction mandates abstention when information is absent. Scoring focuses solely on sequence-management, independent of factual accuracy.
Complex Request Set:
Probes multi-slot, multi-turn business process, and service dialog complexity. This set includes 11 patterns (e.g., slot-filling detail requests, multi-step recommendations, expansion and repair choices, plus standard repairs and closings) totaling 360 items. Patterns involve intricate management of missing information, preliminary screening turns, and expansion variants, with complexity estimated by the sum of turns () and required slots (): (Moore et al., 10 Jan 2026).
| Set | Patterns | Total Items |
|---|---|---|
| Basic | 9 × 20 | 180 |
| RAG | 9 × 20 | 180 |
| Complex | 11 (varied counts) | 360 |
Variants exist for spoken audio LLMs (Zhang et al., 27 Jun 2025), and naturalistic ASR (Maheshwari et al., 2024), each adapting the benchmark structure as appropriate (e.g., audio queries and paralinguistic features, detailed disfluency annotation).
4. Dataset Construction and Annotation Protocols
The four-stage test item generation pipeline for LLM-focused NC-Bench proceeds as follows (Moore et al., 10 Jan 2026):
- Pattern Selection: Choose one NCF pattern for testing.
- Example Creation: Author at least 20 concrete transcripts per pattern. Sources include DailyDialog for Basic, Wikipedia for RAG, and diverse synthetic business scenarios for Complex Request sets. Each example omits the final agent turn, which is to be generated by the model.
- Generation: Prompt each of six target LLMs to continue the transcript using greedy decoding (max 128 tokens).
- Judgment & Scoring: A “judge LLM” (Mistral-Large-Instruct-2411) classifies each response, which is then scored via the binary matching scheme above.
Audio-based NC-Bench construction (Zhang et al., 27 Jun 2025) uses the WildChat corpus and synthetic “paralinguistic-featured” recordings, followed by TTS voice-cloning, synthetic and human speech noise injection, balancing over speaker types, and paralinguistic phenomena.
Conversational ASR NC-Bench construction (Maheshwari et al., 2024) employs real phone-call corpora (TalkBank), extracting, annotating, and pre-processing natural conversations. Detailed annotation covers [PAUSE], [FILLER], [INTERRUPT], [EVENT] markers, using timestamped inline tags and dual-channel diarization.
5. Evaluation Metrics, Experimental Protocols, and Results
Text-based LLM NC-Bench:
Judgment is performed by an LLM classifier, with binary scoring. Initial experiments with six open-source LLMs (granite-2B/8B, llama-3B/8B, qwen-3B/7B) report:
| Model | Basic (%) | RAG (%) | Complex (%) |
|---|---|---|---|
| granite-2B | 72.22 | 76.11 | 80.15 |
| granite-8B | 76.11 | 77.77 | 77.04 |
| llama-3B | 66.66 | 60.00 | 67.80 |
| llama-8B | 68.88 | 68.88 | 71.06 |
| qwen-3B | 82.22 | 75.55 | 62.19 |
| qwen-7B | 80.55 | 73.88 | 76.06 |
Answering tasks (Inquiry, Incremental, Self-Correction) achieve near-ceiling scores. Repair (especially Repeat Request), paraphrase, and closing acts remain challenging, particularly under complexity. In RAG, ungrounded queries often yield hallucinated answers instead of appropriate NonAnswer moves (Moore et al., 10 Jan 2026).
Audio LLM NC-Bench:
Scoring uses a hybrid of ASR-transcribed response grading and acoustic measures (UTMOS). Checklists are dynamically generated per-query, and automatic evaluation is enhanced by query-aware prompts to GPT-4o mini. Human-model correlation improves (Pearson up by 10–15%) when using multi-round ASR and query-aware checklists. Models reveal strong variance in performance for paralinguistic features; pipeline approaches typically lose prosodic cues, while end-to-end models (GPT-4o-Audio) perform best (Zhang et al., 27 Jun 2025).
Conversational ASR NC-Bench:
Key metric is Word Error Rate (WER), with optional disfluency-normalized and pause-insensitive variants: where = substitutions, = deletions, = insertions, = reference words.
Zero-shot results show, e.g., Whisper’s WER jumps from 0.11 (LibriSpeech) to 0.54 (TalkBank conversational), with longer segments and speaker-switches further degrading performance. Disfluency marker density exhibits positive correlation ( ≈ +0.20) with WER (Maheshwari et al., 2024).
6. Extensibility, Limitations, and Comparative Positioning
NC-Bench’s modular test-of-pattern design allows extensibility: new NCF patterns, domains, and modalities augment the suite without reengineering core logic. Over 120 NCF patterns are available for future inclusion. The lightweight, open-source framework invites community contribution for emerging conversational challenges, such as embodied reference, storytelling, and non-English dialog structures (Moore et al., 10 Jan 2026).
Contrast with topical or task-oriented benchmarks (e.g., math reasoning, factual QA) is central: NC-Bench uniquely isolates the infrastructure of conversation (repair, escalation, closing, elicitation). By operating on form, not task or content, it complements suites like MT-Bench or MT-RAG and provides direct measurement of generic conversational competence.
7. Applications and Impact
NC-Bench serves as a diagnostic for LLMs, speech models, and ASR systems, exposing specific dialogic deficits such as lack of repair, improper closing, or inability to handle information-seeking adjacency pairs under complexity or retrieval constraints. Recent dynamic benchmarking extensions (e.g., (Castillo-Bolado et al., 2024)) further probe long-term memory, continual learning, and information integration in interleaved multi-task settings, simulating more naturalistic and challenging user–agent interactions.
A notable implication is that models achieving strong topical or factual coverage may still lack basic "conversational intelligence," as formalized by mastery of NCF patterns. NC-Bench thereby offers a high-bar, theory-grounded regime for next-generation evaluation, targeting coordination and interactional structure at the heart of human language use (Moore et al., 10 Jan 2026, Zhang et al., 27 Jun 2025, Castillo-Bolado et al., 2024, Maheshwari et al., 2024).