Papers
Topics
Authors
Recent
Search
2000 character limit reached

SayNext-Bench: Multimodal Utterance Prediction

Updated 6 February 2026
  • SayNext-Bench is a multimodal benchmark that evaluates next-utterance prediction by integrating text, gestures, facial expressions, and prosodic tones.
  • It introduces novel datasets, protocols, and a dual-route predictive processing architecture to rigorously assess multimodal AI dialogue performance.
  • Baseline evaluations show that cognitive-inspired priming tokens improve semantic, lexical, and emotional alignment compared to traditional models.

SayNext-Bench is a multimodal benchmark specifically designed to evaluate the ability of LLMs and multimodal LLMs (MLLMs) to perform next-utterance prediction in human dialogue. Unlike traditional next-token prediction tasks, next-utterance prediction is intended as a more human-like test, requiring integration of both verbal and non-verbal cues (gesture, gaze, facial expression, prosodic tone) to anticipate a speaker’s immediate response in real-world conversational settings. The benchmark introduces novel datasets, protocols, evaluation metrics, and a cognitively inspired model architecture to systematically probe the predictive processing capabilities of contemporary multimodal AI systems (Yang et al., 30 Jan 2026).

1. Theoretical Motivation and Task Formulation

Human conversation is fundamentally predictive in nature: listeners routinely anticipate a speaker’s next utterance not only from prior lexical content but also from multimodal, affective, and contextual cues. SayNext-Bench explicitly frames next-utterance prediction as a test of whether a model possesses active, human-like predictive processing—exceeding mere next-token completion found in standard language modeling. Robust performance in this task requires:

(i) Multimodal social perception, (ii) Inference of latent intentions and affective states, (iii) Activation of top-down predictive priors to inform generative processes.

Formally, given prior context C=TAC = T_A (the interviewer’s text), visual frames %%%%1%%%% capturing non-verbal signals, and the ground truth response u=TRu^* = T_R, the model fθf_\theta seeks to approximate

P(uC,V)fθ(C,V)P(u^* \mid C, V) \approx f_\theta(C, V)

Training optimizes a joint objective,

L(θ)=Ljoint+λLpriming,\mathcal{L}(\theta) = \mathcal{L}_{\mathrm{joint}} + \lambda \cdot \mathcal{L}_{\mathrm{priming}},

where Ljoint\mathcal{L}_{\mathrm{joint}} is autoregressive cross-entropy and Lpriming=p^p22\mathcal{L}_{\mathrm{priming}} = \| \hat{p} - p^* \|_2^2 aligns learned priming vectors with semantic/affective priors (Yang et al., 30 Jan 2026).

2. Benchmark Protocols and Experimental Design

SayNext-Bench encompasses four primary evaluation protocols, each designed to stress-test different generalization axes:

  • Subject-dependent: Train/test splits include shared speakers to evaluate within-individual generalization.
  • Subject-independent: Test subjects are unseen in training, stratified geographically across five continents, assessing cross-personal generalization.
  • Cross-scenario transfer: Zero-shot evaluation on IEMOCAP dataset spanning diverse conversational genres (workplace, family, romantic dialogue), probing domain transfer.
  • Scalability: Benchmarks from the smaller SayNext-PC2K (2,092 min, 5,432 turns, 72 subjects) to the large-scale SayNext-PC19K (20,766 min, 38,540 turns, 474 subjects), quantifying scaling effects.

For all protocols, the system is presented with the interviewer’s question text and a temporally synchronized video segment capturing the interviewee’s non-verbal pre-response behavior; the target is generation of the actual speaker response (Yang et al., 30 Jan 2026).

3. The SayNext-PC Dataset and Multimodal Annotation

SayNext-PC underpins the benchmark and is derived from post-match tennis press-conference videos, offering rich, real-world multimodal data. Key construction attributes include:

  • Source: Grand Slam and major tournament interviews (2017–2024), inspired by iMiGUE.
  • Resolution and corpus size: PC2K at 1280×720 (2,092 minutes, 5,432 turns); PC19K adds 3,463 videos, 38,540 turns, at 640×360 or 1280×720 resolutions.
  • Modalities:

Text transcripts generated by Whisper (WER ≈ 4.11%), Video segments capturing expressions, gestures, gaze, posture, and tone, Micro-body annotations from iMiGUE for high-granularity gesture and affective cues.

  • Segmentation: Speaker diarization isolates question–response pairs; manual transcript verification ensures accuracy.

This design supports fine-grained study of how multimodal cues inform anticipatory language generation (Yang et al., 30 Jan 2026).

4. Model Architectures and Baseline Evaluation

Evaluation includes MLLMs such as GPT-4o, Gemini 2.5-Flash, InternVL2-8B, VideoLLaMA3-7B, LLaVA-NeXT-Video-7B, InstructBLIP-7B, and Emotion-LLaMA-7B. Despite advances, zero-shot results reveal pronounced deficiencies:

  • Lexical Overlap: BLEU ⁣ ⁣4<1%\mathrm{BLEU}\!-\!4 < 1\%, ROUGE ⁣ ⁣L<15%\mathrm{ROUGE}\!-\!L < 15\%
  • Semantic Similarity: BERTScore-F1 0.40\approx 0.40–$0.55$, Sentence-BERT 0.15\approx 0.15–$0.45$
  • Emotion Consistency: Valence/Arousal 0.45\approx 0.45–$0.80$

Documented limitations include:

  • Ignoring or underutilizing non-verbal cues,
  • Absence of mechanisms for instantiating top-down predictive priors,
  • Difficulty with subtle affective phenomena, e.g., sarcasm and humor,
  • Significant performance drops when generalizing to unseen speakers.

These results highlight a gap between next-token statistical fitting and the anticipatory, context-sensitive nature of real human interaction (Yang et al., 30 Jan 2026).

5. SayNext-Chat: Dual-Route Predictive Processing Model

SayNext-Chat introduces a cognitively inspired “dual-route prediction” architecture comprising:

A. Fast Route

  • Visual encoder (InternViT-300M) computes frame embeddings hv=EV(V)h_v = E_V(V);
  • Text tokenizer/embedding ht=ET(C)h_t = E_T(C);
  • Early fusion via an MLP: hft=MLP([hv;ht])h_{ft} = {\rm MLP}([h_v; h_t]);
  • LoRA-tuned InternLM2.5-7B autoregressively generates responses incorporating hfth_{ft}.

B. Deep Route (Predictive Priors)

  • Non-verbal feature extractor derives embedding epe_p;
  • epe_p projected to priming vector p^=Wpep+bp\hat{p} = W_p e_p + b_p, with MSE supervision to pp^*;
  • “Priming token” tokenptoken_p is embedded in the LLM’s context, biasing generation according to anticipated semantic/affective themes.

Training jointly optimizes language modeling and priming objectives, with adaptive λ(t)\lambda(t). Fine-tuning is performed using AdamW (LoRA rank 16, 2×2 \times A100, learning rate 1×1041 \times 10^{-4}, cosine decay) (Yang et al., 30 Jan 2026).

6. Metrics and Quantitative Results

Performance is assessed along three dimensions using six metrics:

Dimension Metric Brief Description
Lexical Overlap (LO) BLEU-4, ROUGE-L nn-gram overlap, LCS-based sequence alignment
Semantic Similarity (SS) BERTScore-F1, SBERT Embedding-based phrase and sentence-level similarity
Emotion Consistency (EC) Valence, Arousal NRC-VAD lexicon-based affective signal alignment

Key Equations:

  • BLEU-4: Pn=gn-gramsmin(countu^(g),maxrRcountr(g))gn-grams(u^)countu^(g)P_n = \frac{\sum_{g\in n\text{-grams}} \min(\mathrm{count}_{\hat{u}}(g), \max_{r\in R} \mathrm{count}_r(g))}{\sum_{g\in n\text{-grams}(\hat{u})} \mathrm{count}_{\hat{u}}(g)} combined with brevity penalty [formula as in original block].
  • BERTScore-F1 measures token-level embedding cosine similarity.
  • Emotion consistency computed as Δk=1Sk(r)Sk(u^)0.8Ck(r)Ck(u^)\Delta_k = 1 - |S_k(r) - S_k(\hat{u})| - 0.8|C_k(r) - C_k(\hat{u})|, with kk in {Valence, Arousal}.

Quantitatively, SayNext-Chat yields superior results: On SayNext-PC2K subject-dependent split, BLEU-4 (2.31%) and ROUGE-L (17.96%) exceed GPT-4o (1.08%, 14.62%) and InternVL2 (0.77%, 13.94%); BERTScore-F1 (0.5651) outperforms others (0.5489, 0.5468). Cross-scenario (IEMOCAP) shows BLEU-4 = 5.44% (next best 0.91%) and enhanced affective alignment. Ablation studies confirm that learnable priming tokens drive +2+2–$3$ point gains in valence/arousal and generally improve both semantic and lexical metrics (Yang et al., 30 Jan 2026).

7. Findings, Implications, and Future Directions

SayNext-Bench establishes next-utterance prediction as a stringent, cognitively salient benchmark for human-like dialogue intelligence. Empirical results indicate:

  • Multimodal cues are essential for realistic response anticipation; text-only models are fundamentally limited.
  • Embedding “priming factors” as learnable tokens enables LLMs to pre-activate latent semantic and affective dimensions, enhancing both content and emotional congruency.
  • Passive statistical next-token modeling cannot replicate predictive processing and thus remains insufficient for human-centered AI interaction.

This suggests that cognitively inspired model design—specifically dual-route predictive processing—addresses key limitations highlighted by Moravec’s Paradox. A plausible implication is that further research should target pragmatic and stylistic expansions (including sarcasm and humor), multi-turn context modeling, and advanced cognitive evaluation techniques to foster genuinely empathetic and anticipatory AI partners (Yang et al., 30 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SayNext-Bench.