SayNext-Bench: Multimodal Utterance Prediction
- SayNext-Bench is a multimodal benchmark that evaluates next-utterance prediction by integrating text, gestures, facial expressions, and prosodic tones.
- It introduces novel datasets, protocols, and a dual-route predictive processing architecture to rigorously assess multimodal AI dialogue performance.
- Baseline evaluations show that cognitive-inspired priming tokens improve semantic, lexical, and emotional alignment compared to traditional models.
SayNext-Bench is a multimodal benchmark specifically designed to evaluate the ability of LLMs and multimodal LLMs (MLLMs) to perform next-utterance prediction in human dialogue. Unlike traditional next-token prediction tasks, next-utterance prediction is intended as a more human-like test, requiring integration of both verbal and non-verbal cues (gesture, gaze, facial expression, prosodic tone) to anticipate a speaker’s immediate response in real-world conversational settings. The benchmark introduces novel datasets, protocols, evaluation metrics, and a cognitively inspired model architecture to systematically probe the predictive processing capabilities of contemporary multimodal AI systems (Yang et al., 30 Jan 2026).
1. Theoretical Motivation and Task Formulation
Human conversation is fundamentally predictive in nature: listeners routinely anticipate a speaker’s next utterance not only from prior lexical content but also from multimodal, affective, and contextual cues. SayNext-Bench explicitly frames next-utterance prediction as a test of whether a model possesses active, human-like predictive processing—exceeding mere next-token completion found in standard language modeling. Robust performance in this task requires:
(i) Multimodal social perception, (ii) Inference of latent intentions and affective states, (iii) Activation of top-down predictive priors to inform generative processes.
Formally, given prior context (the interviewer’s text), visual frames %%%%1%%%% capturing non-verbal signals, and the ground truth response , the model seeks to approximate
Training optimizes a joint objective,
where is autoregressive cross-entropy and aligns learned priming vectors with semantic/affective priors (Yang et al., 30 Jan 2026).
2. Benchmark Protocols and Experimental Design
SayNext-Bench encompasses four primary evaluation protocols, each designed to stress-test different generalization axes:
- Subject-dependent: Train/test splits include shared speakers to evaluate within-individual generalization.
- Subject-independent: Test subjects are unseen in training, stratified geographically across five continents, assessing cross-personal generalization.
- Cross-scenario transfer: Zero-shot evaluation on IEMOCAP dataset spanning diverse conversational genres (workplace, family, romantic dialogue), probing domain transfer.
- Scalability: Benchmarks from the smaller SayNext-PC2K (2,092 min, 5,432 turns, 72 subjects) to the large-scale SayNext-PC19K (20,766 min, 38,540 turns, 474 subjects), quantifying scaling effects.
For all protocols, the system is presented with the interviewer’s question text and a temporally synchronized video segment capturing the interviewee’s non-verbal pre-response behavior; the target is generation of the actual speaker response (Yang et al., 30 Jan 2026).
3. The SayNext-PC Dataset and Multimodal Annotation
SayNext-PC underpins the benchmark and is derived from post-match tennis press-conference videos, offering rich, real-world multimodal data. Key construction attributes include:
- Source: Grand Slam and major tournament interviews (2017–2024), inspired by iMiGUE.
- Resolution and corpus size: PC2K at 1280×720 (2,092 minutes, 5,432 turns); PC19K adds 3,463 videos, 38,540 turns, at 640×360 or 1280×720 resolutions.
- Modalities:
Text transcripts generated by Whisper (WER ≈ 4.11%), Video segments capturing expressions, gestures, gaze, posture, and tone, Micro-body annotations from iMiGUE for high-granularity gesture and affective cues.
- Segmentation: Speaker diarization isolates question–response pairs; manual transcript verification ensures accuracy.
This design supports fine-grained study of how multimodal cues inform anticipatory language generation (Yang et al., 30 Jan 2026).
4. Model Architectures and Baseline Evaluation
Evaluation includes MLLMs such as GPT-4o, Gemini 2.5-Flash, InternVL2-8B, VideoLLaMA3-7B, LLaVA-NeXT-Video-7B, InstructBLIP-7B, and Emotion-LLaMA-7B. Despite advances, zero-shot results reveal pronounced deficiencies:
- Lexical Overlap: ,
- Semantic Similarity: BERTScore-F1 –$0.55$, Sentence-BERT –$0.45$
- Emotion Consistency: Valence/Arousal –$0.80$
Documented limitations include:
- Ignoring or underutilizing non-verbal cues,
- Absence of mechanisms for instantiating top-down predictive priors,
- Difficulty with subtle affective phenomena, e.g., sarcasm and humor,
- Significant performance drops when generalizing to unseen speakers.
These results highlight a gap between next-token statistical fitting and the anticipatory, context-sensitive nature of real human interaction (Yang et al., 30 Jan 2026).
5. SayNext-Chat: Dual-Route Predictive Processing Model
SayNext-Chat introduces a cognitively inspired “dual-route prediction” architecture comprising:
A. Fast Route
- Visual encoder (InternViT-300M) computes frame embeddings ;
- Text tokenizer/embedding ;
- Early fusion via an MLP: ;
- LoRA-tuned InternLM2.5-7B autoregressively generates responses incorporating .
B. Deep Route (Predictive Priors)
- Non-verbal feature extractor derives embedding ;
- projected to priming vector , with MSE supervision to ;
- “Priming token” is embedded in the LLM’s context, biasing generation according to anticipated semantic/affective themes.
Training jointly optimizes language modeling and priming objectives, with adaptive . Fine-tuning is performed using AdamW (LoRA rank 16, A100, learning rate , cosine decay) (Yang et al., 30 Jan 2026).
6. Metrics and Quantitative Results
Performance is assessed along three dimensions using six metrics:
| Dimension | Metric | Brief Description |
|---|---|---|
| Lexical Overlap (LO) | BLEU-4, ROUGE-L | -gram overlap, LCS-based sequence alignment |
| Semantic Similarity (SS) | BERTScore-F1, SBERT | Embedding-based phrase and sentence-level similarity |
| Emotion Consistency (EC) | Valence, Arousal | NRC-VAD lexicon-based affective signal alignment |
Key Equations:
- BLEU-4: combined with brevity penalty [formula as in original block].
- BERTScore-F1 measures token-level embedding cosine similarity.
- Emotion consistency computed as , with in {Valence, Arousal}.
Quantitatively, SayNext-Chat yields superior results: On SayNext-PC2K subject-dependent split, BLEU-4 (2.31%) and ROUGE-L (17.96%) exceed GPT-4o (1.08%, 14.62%) and InternVL2 (0.77%, 13.94%); BERTScore-F1 (0.5651) outperforms others (0.5489, 0.5468). Cross-scenario (IEMOCAP) shows BLEU-4 = 5.44% (next best 0.91%) and enhanced affective alignment. Ablation studies confirm that learnable priming tokens drive –$3$ point gains in valence/arousal and generally improve both semantic and lexical metrics (Yang et al., 30 Jan 2026).
7. Findings, Implications, and Future Directions
SayNext-Bench establishes next-utterance prediction as a stringent, cognitively salient benchmark for human-like dialogue intelligence. Empirical results indicate:
- Multimodal cues are essential for realistic response anticipation; text-only models are fundamentally limited.
- Embedding “priming factors” as learnable tokens enables LLMs to pre-activate latent semantic and affective dimensions, enhancing both content and emotional congruency.
- Passive statistical next-token modeling cannot replicate predictive processing and thus remains insufficient for human-centered AI interaction.
This suggests that cognitively inspired model design—specifically dual-route predictive processing—addresses key limitations highlighted by Moravec’s Paradox. A plausible implication is that further research should target pragmatic and stylistic expansions (including sarcasm and humor), multi-turn context modeling, and advanced cognitive evaluation techniques to foster genuinely empathetic and anticipatory AI partners (Yang et al., 30 Jan 2026).