DialogGraph-LLM: Audio Dialogue Intent Framework

Updated 17 November 2025

DialogGraph-LLM is an end-to-end system that fuses multi-relational graph neural architectures with multimodal large language models to accurately capture speaker intents from audio dialogues.
Its innovative MR-DAN component leverages temporal, speaker-specific, and semantic relations, yielding up to +13.7% accuracy gains over traditional baselines.
The framework employs adaptive semi-supervised learning with dynamic thresholds, ensuring robust performance and scalable inference in real-world applications.

DialogGraph-LLM is an end-to-end framework for audio dialogue intent recognition that integrates multi-relational graph neural architectures with multimodal foundation LLMs, specifically designed to address complex inter-dependencies in multi-speaker audio dialogues and excel under limited supervision. The central innovation is the composition of a novel Multi-Relational Dialogue Attention Network (MR-DAN) with adaptive semi-supervised learning mechanisms, yielding robust, scalable inference directly from audio to speaker intent classification. DialogGraph-LLM demonstrates competitive performance in real-world and benchmark scenarios, outperforming prominent audio- and text-based baselines.

1. Pipeline Architecture and Preprocessing

DialogGraph-LLM’s pipeline commences with raw audio input from multi-speaker dialogues. Speaker diarization decomposes the audio $A$ into a sequence of utterance segments $\{a_j\}_{j=1}^L$ , each assigned a speaker ID $s_j$ . The Qwen2.5-Omni-7B backbone’s audio encoder $\Phi$ generates:

Utterance representations: $h_j = \Phi(a_j)$ for each segment.
Dialogue-level representation: $G = \Phi(A)$ incorporating the global acoustic overview.

These embeddings serve as node and global features for subsequent graph-based relational modeling.

2. Multi-Relational Dialogue Attention Network (MR-DAN)

Fundamentally, MR-DAN constructs a graph over utterances, with each node $v_j$ initialized as: $x_j^{(0)} = W_p\bigl[h_j;\,e_{s_j}\bigr]$ where $e_{s_j}$ is a learnable speaker embedding and $h_j$ encodes the acoustic content.

The dialogue graph uses four hand-designed edge types ( $\{a_j\}_{j=1}^L$ 0):

Temporal adjacency (sequential utterances)
Speaker-history adjacency (utterance by same speaker)
Cross-utterance semantic adjacency (cross-turn relations exceeding semantic similarity threshold $\{a_j\}_{j=1}^L$ 1)
Self-loops.

Adjacency matrices $\{a_j\}_{j=1}^L$ 2 are constructed for each edge type $\{a_j\}_{j=1}^L$ 3, determining the permitted attention neighborhoods.

Relation-aware multi-head attention is employed:

$\{a_j\}_{j=1}^L$ 4 attention heads are partitioned into $\{a_j\}_{j=1}^L$ 5 groups, each dedicated to a relation type $\{a_j\}_{j=1}^L$ 6.
Attention heads for relation $\{a_j\}_{j=1}^L$ 7 only attend to neighbors under $\{a_j\}_{j=1}^L$ 8 according to $\{a_j\}_{j=1}^L$ 9.
For each head:

$s_j$ 0

$s_j$ 1

$s_j$ 2

Heads are concatenated per relation ( $s_j$ 3), followed by a relation-informed update: $s_j$ 4

An alternative update form aggregates learnable relation bias matrices $s_j$ 5: $s_j$ 6

$s_j$ 7

After $s_j$ 8 iterations, the graph-level embedding $s_j$ 9 is acquired via mean pooling.

3. LLM Integration and Input Fusion

DialogGraph-LLM leverages multimodal LLMs via customized input fusion:

Two lightweight adapters map $\Phi$ 0 (global audio embedding) and $\Phi$ 1 (graph embedding) to the LLM input space:

$\Phi$ 2

A prompt template (e.g., “Intent?”) is tokenized, and $\Phi$ 3, $\Phi$ 4 tokens are replaced with their respective adapted embeddings.
These three input streams are concatenated at layer zero and processed through the LLM, producing intent label probabilities:

$\Phi$ 5

$\Phi$ 6

4. Adaptive Semi-Supervised Learning and Pseudo-Labeling

Addressing limited supervision, DialogGraph-LLM incorporates an adaptive semi-supervised learning (SSL) strategy comprising dual-threshold filtering and entropy-based sample selection:

For each unlabeled instance $\Phi$ 7, obtain the posterior $\Phi$ 8 over intent classes.
Maintain global confidence threshold $\Phi$ 9 via exponential moving average (EMA):

$h_j = \Phi(a_j)$ 0

Estimate class marginals $h_j = \Phi(a_j)$ 1 by EMA, forming per-class thresholds:

$h_j = \Phi(a_j)$ 2

Filter instances: accept $h_j = \Phi(a_j)$ 3 iff $h_j = \Phi(a_j)$ 4 and $h_j = \Phi(a_j)$ 5, where $h_j = \Phi(a_j)$ 6.
Compute entropy:

$h_j = \Phi(a_j)$ 7

Rank eligible samples by entropy, augmenting the training set with high-information pseudo-labels.

5. Optimization Objectives and Loss Functions

Training optimizes a unified objective over the labeled set ( $h_j = \Phi(a_j)$ 8) and selected pseudo-labeled samples ( $h_j = \Phi(a_j)$ 9):

Supervised cross-entropy loss:

$G = \Phi(A)$ 0

Unsupervised loss over pseudo-labels:

$G = \Phi(A)$ 1

Regularized joint objective:

$G = \Phi(A)$ 2

where $G = \Phi(A)$ 3 scales unsupervised contribution and $G = \Phi(A)$ 4 is the $G = \Phi(A)$ 5 regularization coefficient.

6. Empirical Evaluation and Results

DialogGraph-LLM is evaluated on two datasets:

MarketCalls: 8,770 real Mandarin sales calls, annotated across four hierarchical intent levels (A–D), with diarized speaker turns.
MIntRec 2.0: public multimodal benchmark for intent recognition (in-scope/out-of-scope) in audio/text dialogues.

Metrics include accuracy, macro-F1, and per-class F1. Baselines comprise Llama3.1-8B, GLM-4-9B, Gemini1.5-Pro, Qwen2.5-Omni, as well as multimodal methods MAG-BERT, MulT, TCL-MAP, and A-MESS.

Performance Summary

Dataset	Best Baseline	DialogGraph-LLM	Gain
MarketCalls	63.6% / 63.1% (Qwen)	77.3% / 76.8%	+13.7%
MIntRec 2.0	56.8% / 49.3% (A-MESS)	64.3% / 58.1%	+7.5%

Ablation studies show that omitting MR-DAN results in sharp performance drops and that fixed-threshold SSL is suboptimal (73.6% accuracy vs. 77.3%). Optimizing MR-DAN hyperparameters (8 heads, window $G = \Phi(A)$ 6, cross-turn threshold $G = \Phi(A)$ 7) produces consistent empirical peaks.

7. Contributions, Limitations, and Prospects

DialogGraph-LLM delivers:

An audio-to-intent pipeline integrating raw audio, relational graph structure, and LLMs.
MR-DAN, a multi-relational attention GNN modeling temporal, speaker, and semantic dialogue relations with fixed edge types.
Adaptive semi-supervised learning with dynamic thresholds and entropy-based instance selection.

Limitations include exclusive evaluation on Qwen2.5-Omni-7B, manual edge type specification, and residual pseudo-label noise. Suggested future work involves broadening foundation model backbones (e.g., GPT-4o, AudioPalm), end-to-end edge-type learning, advanced SSL schemes (consistency regularization, co-training), and extension to streaming/long-context inference.

A plausible implication is that integrating explicit dialogue graph structure with foundation LLMs under limited supervision yields substantial gains in intent recognition accuracy and data efficiency for audio-rich domains.

Markdown Report Issue Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DialogGraph-LLM.