Papers
Topics
Authors
Recent
Search
2000 character limit reached

DialogGraph-LLM: Audio Dialogue Intent Framework

Updated 17 November 2025
  • DialogGraph-LLM is an end-to-end system that fuses multi-relational graph neural architectures with multimodal large language models to accurately capture speaker intents from audio dialogues.
  • Its innovative MR-DAN component leverages temporal, speaker-specific, and semantic relations, yielding up to +13.7% accuracy gains over traditional baselines.
  • The framework employs adaptive semi-supervised learning with dynamic thresholds, ensuring robust performance and scalable inference in real-world applications.

DialogGraph-LLM is an end-to-end framework for audio dialogue intent recognition that integrates multi-relational graph neural architectures with multimodal foundation LLMs, specifically designed to address complex inter-dependencies in multi-speaker audio dialogues and excel under limited supervision. The central innovation is the composition of a novel Multi-Relational Dialogue Attention Network (MR-DAN) with adaptive semi-supervised learning mechanisms, yielding robust, scalable inference directly from audio to speaker intent classification. DialogGraph-LLM demonstrates competitive performance in real-world and benchmark scenarios, outperforming prominent audio- and text-based baselines.

1. Pipeline Architecture and Preprocessing

DialogGraph-LLM’s pipeline commences with raw audio input from multi-speaker dialogues. Speaker diarization decomposes the audio AA into a sequence of utterance segments {aj}j=1L\{a_j\}_{j=1}^L, each assigned a speaker ID sjs_j. The Qwen2.5-Omni-7B backbone’s audio encoder Φ\Phi generates:

  • Utterance representations: hj=Φ(aj)h_j = \Phi(a_j) for each segment.
  • Dialogue-level representation: G=Φ(A)G = \Phi(A) incorporating the global acoustic overview.

These embeddings serve as node and global features for subsequent graph-based relational modeling.

2. Multi-Relational Dialogue Attention Network (MR-DAN)

Fundamentally, MR-DAN constructs a graph over utterances, with each node vjv_j initialized as: xj(0)=Wp[hj;esj]x_j^{(0)} = W_p\bigl[h_j;\,e_{s_j}\bigr] where esje_{s_j} is a learnable speaker embedding and hjh_j encodes the acoustic content.

The dialogue graph uses four hand-designed edge types ({aj}j=1L\{a_j\}_{j=1}^L0):

  • Temporal adjacency (sequential utterances)
  • Speaker-history adjacency (utterance by same speaker)
  • Cross-utterance semantic adjacency (cross-turn relations exceeding semantic similarity threshold {aj}j=1L\{a_j\}_{j=1}^L1)
  • Self-loops.

Adjacency matrices {aj}j=1L\{a_j\}_{j=1}^L2 are constructed for each edge type {aj}j=1L\{a_j\}_{j=1}^L3, determining the permitted attention neighborhoods.

Relation-aware multi-head attention is employed:

  • {aj}j=1L\{a_j\}_{j=1}^L4 attention heads are partitioned into {aj}j=1L\{a_j\}_{j=1}^L5 groups, each dedicated to a relation type {aj}j=1L\{a_j\}_{j=1}^L6.
  • Attention heads for relation {aj}j=1L\{a_j\}_{j=1}^L7 only attend to neighbors under {aj}j=1L\{a_j\}_{j=1}^L8 according to {aj}j=1L\{a_j\}_{j=1}^L9.
  • For each head:

sjs_j0

sjs_j1

sjs_j2

Heads are concatenated per relation (sjs_j3), followed by a relation-informed update: sjs_j4

An alternative update form aggregates learnable relation bias matrices sjs_j5: sjs_j6

sjs_j7

After sjs_j8 iterations, the graph-level embedding sjs_j9 is acquired via mean pooling.

3. LLM Integration and Input Fusion

DialogGraph-LLM leverages multimodal LLMs via customized input fusion:

  • Two lightweight adapters map Φ\Phi0 (global audio embedding) and Φ\Phi1 (graph embedding) to the LLM input space:

Φ\Phi2

  • A prompt template (e.g., “Intent?”) is tokenized, and Φ\Phi3, Φ\Phi4 tokens are replaced with their respective adapted embeddings.
  • These three input streams are concatenated at layer zero and processed through the LLM, producing intent label probabilities:

Φ\Phi5

Φ\Phi6

4. Adaptive Semi-Supervised Learning and Pseudo-Labeling

Addressing limited supervision, DialogGraph-LLM incorporates an adaptive semi-supervised learning (SSL) strategy comprising dual-threshold filtering and entropy-based sample selection:

  • For each unlabeled instance Φ\Phi7, obtain the posterior Φ\Phi8 over intent classes.
  • Maintain global confidence threshold Φ\Phi9 via exponential moving average (EMA):

hj=Φ(aj)h_j = \Phi(a_j)0

  • Estimate class marginals hj=Φ(aj)h_j = \Phi(a_j)1 by EMA, forming per-class thresholds:

hj=Φ(aj)h_j = \Phi(a_j)2

  • Filter instances: accept hj=Φ(aj)h_j = \Phi(a_j)3 iff hj=Φ(aj)h_j = \Phi(a_j)4 and hj=Φ(aj)h_j = \Phi(a_j)5, where hj=Φ(aj)h_j = \Phi(a_j)6.
  • Compute entropy:

hj=Φ(aj)h_j = \Phi(a_j)7

  • Rank eligible samples by entropy, augmenting the training set with high-information pseudo-labels.

5. Optimization Objectives and Loss Functions

Training optimizes a unified objective over the labeled set (hj=Φ(aj)h_j = \Phi(a_j)8) and selected pseudo-labeled samples (hj=Φ(aj)h_j = \Phi(a_j)9):

  • Supervised cross-entropy loss:

G=Φ(A)G = \Phi(A)0

  • Unsupervised loss over pseudo-labels:

G=Φ(A)G = \Phi(A)1

  • Regularized joint objective:

G=Φ(A)G = \Phi(A)2

where G=Φ(A)G = \Phi(A)3 scales unsupervised contribution and G=Φ(A)G = \Phi(A)4 is the G=Φ(A)G = \Phi(A)5 regularization coefficient.

6. Empirical Evaluation and Results

DialogGraph-LLM is evaluated on two datasets:

  • MarketCalls: 8,770 real Mandarin sales calls, annotated across four hierarchical intent levels (A–D), with diarized speaker turns.
  • MIntRec 2.0: public multimodal benchmark for intent recognition (in-scope/out-of-scope) in audio/text dialogues.

Metrics include accuracy, macro-F1, and per-class F1. Baselines comprise Llama3.1-8B, GLM-4-9B, Gemini1.5-Pro, Qwen2.5-Omni, as well as multimodal methods MAG-BERT, MulT, TCL-MAP, and A-MESS.

Performance Summary

Dataset Best Baseline DialogGraph-LLM Gain
MarketCalls 63.6% / 63.1% (Qwen) 77.3% / 76.8% +13.7%
MIntRec 2.0 56.8% / 49.3% (A-MESS) 64.3% / 58.1% +7.5%

Ablation studies show that omitting MR-DAN results in sharp performance drops and that fixed-threshold SSL is suboptimal (73.6% accuracy vs. 77.3%). Optimizing MR-DAN hyperparameters (8 heads, window G=Φ(A)G = \Phi(A)6, cross-turn threshold G=Φ(A)G = \Phi(A)7) produces consistent empirical peaks.

7. Contributions, Limitations, and Prospects

DialogGraph-LLM delivers:

  • An audio-to-intent pipeline integrating raw audio, relational graph structure, and LLMs.
  • MR-DAN, a multi-relational attention GNN modeling temporal, speaker, and semantic dialogue relations with fixed edge types.
  • Adaptive semi-supervised learning with dynamic thresholds and entropy-based instance selection.

Limitations include exclusive evaluation on Qwen2.5-Omni-7B, manual edge type specification, and residual pseudo-label noise. Suggested future work involves broadening foundation model backbones (e.g., GPT-4o, AudioPalm), end-to-end edge-type learning, advanced SSL schemes (consistency regularization, co-training), and extension to streaming/long-context inference.

A plausible implication is that integrating explicit dialogue graph structure with foundation LLMs under limited supervision yields substantial gains in intent recognition accuracy and data efficiency for audio-rich domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DialogGraph-LLM.