IBM Natural Conversation Framework
- IBM NCF is a systematic taxonomy that models diverse conversational interactions in AI agents with over 120 defined patterns.
- NC-Bench operationalizes NCF by evaluating inquiry, repair, and closing sequences across multiple dialogue scenarios.
- Empirical insights show high performance in basic patterns but reveal challenges in handling complex multi-turn and repair sequences.
The IBM Natural Conversation Framework (NCF) is a comprehensive, theory-grounded system for modeling and evaluating conversational competence in artificial agents, with specific instantiation in LLMs. Developed from foundational work in Conversation Analysis and refined through extensive empirical and application-driven research, NCF organizes conversational interaction into a taxonomy of recurrent patterns that structure agent behavior across diverse domains. Its operationalization in resources such as NC-Bench provides standardized methodologies for diagnosing conversational strengths and deficits in model architectures, tracing precise conversational moves, and supporting extensible evaluation protocols (Moore et al., 10 Jan 2026).
1. Theoretical Foundations
NCF's architecture draws heavily on Conversation Analysis, specifically work by Sacks, Schegloff, and Jefferson (1974, 2007), detailing the micro-structure of authentic, naturally occurring talk-in-interaction. Conversation is conceptualized not merely as a channel for information transfer but as a social achievement characterized by orderly coordination in turn-taking, adjacency pairs, repairs, closings, and preliminary moves. NCF operationalizes this by defining a library of over 120 generic conversational patterns, such as adjacency-pairs, expansions, repair sequences, and closers, extracted from cross-domain conversational data and research.
First articulated by Moore (2018), and developed further by Moore, An, & Ren (2023), the NCF is designed to diagnose conversational competence in agents beyond mere question-answering: encompassing repair, closing, elicitation, recommendation, and other interactional phenomena. The pattern-based approach enables designers and evaluators to instantiate nuanced UX flows for conversational applications and to systematically analyze agents’ facility with requisite patterns.
2. Taxonomy of Sequence-Management Patterns
NCF structures conversational activity into a taxonomy of sequence-management patterns. In NC-Bench, these patterns are grouped into three principal activity types: Answering Sequences, Repair Sequences, and Closing Sequences.
Answering Sequences
- Inquiry:
− USER: INQUIRY − AGENT: ANSWER
- Incremental Request:
− USER: INQUIRY − AGENT: ANSWER − USER: INCREMENTAL_REQUEST − AGENT: SUBSEQUENT_ANSWER
- Self-Correction:
− USER: INQUIRY − AGENT: ANSWER − USER: SELF_CORRECTION − AGENT: ALTERNATIVE_RESPONSE
Repair Sequences
- Definition Request / Definition:
− USER: DEFINITION_REQUEST → AGENT: DEFINITION
- Paraphrase Request / Paraphrase:
− USER: PARAPHRASE_REQUEST → AGENT: PARAPHRASE
- Repeat Request / Repeat:
− USER: REPEAT_REQUEST → AGENT: REPEAT
- Example Request / Example:
− USER: EXAMPLE_REQUEST → AGENT: EXAMPLE
Closing Sequences
- Sequence Closer:
− USER: ACKNOWLEDGMENT / ASSESSMENT / LAUGHTER − AGENT: LAST_TOPIC_CHECK or NO_RESPONSE
- Sequence Abort:
− USER: ABORT − AGENT: ACKNOWLEDGMENT
Appendices in NC-Bench include example transcripts for each pattern, facilitating empirical annotation and benchmarking.
3. Instantiation in NC-Bench
NC-Bench operationalizes the NCF taxonomy into three distinct evaluation suites, each targeting specific sequence-management competencies:
- Basic Conversation Competence Suite — Patterns: Inquiry, Incremental Request, Self-Correction, Definition Request, Paraphrase Request, Repeat Request, Example Request, Sequence Closer, Sequence Abort — Data: Everyday dialogues (adapted from DailyDialogue) — Prompting: Role-play instructions, transcript context truncation, final user inquiry
- Retrieval-Augmented Generation (RAG) Suite
- Inquiry (Grounded): answer required from passage
- Inquiry (Ungrounded): requirement to refuse if passage lacks answer
- Incremental/Self-Correction: combined for pragmatic similarity
- — Focus: Whether structure is maintained under context with external grounding
- Complex Request Suite — Patterns encompassing multi-turn business-process flows: Preliminary, Recommendation, Detail Request, Expansion (Choices / Repair) — Data: Synthetic scenarios (travel booking, insurance quoting) — Focus: Competence with business-process management and multi-turn sequence
Each suite uses ∼20 concrete transcripts per pattern, with the final agent turn omitted for LLM generations.
4. Evaluation Metrics and Scoring Scheme
Evaluation in NC-Bench is conducted via automatic dialogue-act classification using a judge LLM (Mistral-Large-Instruct-2411). Each agent-generated turn receives a label from a predefined taxonomy (~30 dialogue acts), mapped to pattern-acceptable labels:
- Turn-level scoring: For each test , ; if the produced label matches any in the acceptable set for the targeted pattern, $0$ otherwise.
- Aggregate suite score:
where is the count of examples in the suite (Basic: 180, RAG: 180, Complex: 360).
Label sets are tightly coupled to each pattern:
- Inquiry/Incremental/Self-Correction: {Answer, NonAnswer, Definition, RepeatRequest, ParaphraseRequest, ExampleRequest}
- Ungrounded Inquiry: {NonAnswer}
- Repeat Request: {Repeat, PartialRepeat}
- Paraphrase Request: {Paraphrase, Definition, Example}
- Sequence Closer: {PreClosing, Silence, NewTopic, Acknowledgment, Assessment, AppreciationReceipt}
- Detail Request: {DetailRequestGrounded}
- Expansion-Choices: {ChoiceGiving}
- Expansion-Repair: {Repeat, Paraphrase, Definition, Example}
5. Model Performance and Empirical Insights
Initial NC-Bench results span six open-source models: granite-2B, granite-8B, llama-3B, llama-8B, qwen-3B, and qwen-7B. Key findings include:
- Answering patterns: High scores (almost 100%) for Inquiry, Incremental, and Self-Correction across all models.
- Repair patterns: Strong performance in Definition, Paraphrase, Example; however, consistent weakness in Repeat Request (models tend to paraphrase or elaborate instead of verbatim repeat).
- Closing patterns: Mixed results; less capable models over-extend, producing more content where a close or acknowledgment is required.
- RAG suite: Grounded inquiries perform robustly when gold answer exists; ungrounded queries elicit hallucinations rather than refusals.
- Complex requests: Preliminaries and detail requests show moderate accuracy; multi-turn slot-filling and recommendation/expansion much harder than single-turn answers.
Top-performing models by suite:
- Basic: Qwen-3B (82.22%)
- RAG: granite-8B (77.77%)
- Complex: granite-2B (80.15%)
A plausible implication is that, while foundational sequence-management skills are tractable for current LLMs, finer-grained repairs, responsible refusal under external context, and multi-turn flow management remain substantial challenges.
6. Extensibility and Systematic Evaluation
NCF’s pattern library and the NC-Bench design are expressly extensible. Additional patterns, such as storytelling sequences or embodied referencing, can be systematically incorporated. The framework's reliance on prompt continuation and automated LLM-based judging permits lightweight deployment across tasks and domains, supporting ongoing evaluative research and dialogue system optimization. By decomposing conversational competence into granular moves and aligning these with research-based interactional theory, NCF facilitates both granular diagnosis and targeted enhancement in conversational agents (Moore et al., 10 Jan 2026).