SayNext-PC: Multimodal Next-Utterance Prediction

Updated 6 February 2026

SayNext-PC is a multimodal dataset and personalized prediction system designed for next-utterance prediction by integrating synchronized text, video, and micro-expression data.
It leverages dyadic interactions from Grand Slam press conferences, offering multiple versions like SayNext-PC2K and SayNext-PC19K for scale and diverse contextual analysis.
The system employs both prompt-based and fine-tuning techniques such as LoRA and prefix tuning to optimize lexical, semantic, and affective metrics while ensuring user privacy.

SayNext-PC is a collection of multimodal datasets and personalization methodologies developed for next-utterance prediction in dialogue, with a particular focus on capturing individual- and context-specific word choice at the token level. It serves as the foundational dataset for the SayNext-Bench benchmark, enabling systematic evaluation of LLMs and multimodal LLMs (MLLMs) in predicting human dialogue responses by integrating both linguistic history and fine-grained nonverbal cues. SayNext-PC also refers to a personalized next-token prediction system, grounded in psychological traits and privacy-aware engineering practices, enabling the modeling of user-aligned linguistic behavior in both synthetic and real-world conversational scenarios (Ding et al., 16 Oct 2025, Yang et al., 30 Jan 2026).

1. Dataset Composition and Construction

SayNext-PC exists in several versions, notably SayNext-PC2K and SayNext-PC19K, designed around three construction criteria: (1) dyadic, one-to-one interactions, (2) synchronized text and video modalities, and (3) fixed camera view on the responder. These constraints ensure clean speaker attribution and alignment of micro-expressions with utterances.

Key statistics for the datasets include:

Subset	# Videos	Duration (min)	# Turns	# Responders	Years
SayNext-PC2K	359	2,092	5,432	72	2017–2019
SayNext-PC19K	3,463	20,766	38,540	474	2017–2024

Modalities are exhaustively annotated: video (ViT-300M features, frame rate 25 fps, standardized resolutions), text (Whisper ASR, manual segmentation, mean WER 4.11%), and micro-body-language labels (aligned using iMiGUE protocol, with features such as lip movements, gaze direction, hand gestures). Record-level metadata includes clip/video IDs, subject/nationality, textual transcriptions, start/end timestamps, emotion annotations (Valence, Arousal, Dominance), micro-expression arrays, and a 20-dimensional cognitive priming vector (for modeling latent emotional/cognitive state activation during response) (Yang et al., 30 Jan 2026).

Data collection derives from large-scale post-match press conference videos from all four Grand Slam tennis tournaments, re-scraped and re-annotated for scale and continuous speaker focus.

2. Problem Formulation and Theoretical Foundations

SayNext-PC supports several task formulations:

Next-Utterance Prediction (Multimodal): Given a question utterance $TA$ and a sequence of synchronized video frames $V$ , predict the respondent’s next utterance $TE$ . Formally, the system estimates $P(\text{truth\_response} \mid TA, V) \approx f_e(TA, V) \to \hat{TE}$ .
Personalized Next-Token Prediction (YNTP paradigm): For user $u$ with context $h = (x, y_{<t})$ (where $x$ is NPC input and $y_{<t}$ are previous user output tokens), plus user profile $U^u = (p^u, H^u)$ (MBTI type, dialogue history), the system predicts the distribution $p_\theta(w_t \mid h, U^u)$ across candidate next tokens. The global learning objective is minimization of token-level cross-entropy:

$L_{CE}(\theta) = -\sum_i \sum_{t=1}^{|y^u_i|} \log p_\theta(y^u_{i,t} \mid y^u_{i,<t}, x_i, U^u_i)$

(Ding et al., 16 Oct 2025).

The data supports analyses of anticipatory processing, affective alignment, and style-personalized language modeling grounded in psychologically interpretable parameters.

3. Evaluation Metrics and Benchmarking

Empirical assessment incorporates lexical, semantic, and affective metrics:

Lexical Overlap: BLEU-4, ROUGE-L.
Semantic Similarity: BERTScore-F1, SBERT cosine similarity.
Emotion Consistency: NRC-VAD-derived metrics along Valence, Arousal, and Dominance dimensions.

For personalized next-token prediction, cross-entropy loss and perplexity are primary, sensitive to deviance from user-specific word-choice distributions. Macro-averaged English results for baseline and personalized adaptation methods are illustrated below:

Method	Cross-Ent ↓	Perplexity ↓	Sentence Sim. (M2) ↑	Style Sim. (M5) ↑	History Sim. (M6) ↑
Zero-Shot Prompting	1.82	6.17	0.43	2.88	0.43
Few-Shot Prompting	1.65	5.21	0.46	3.28	0.45
Few-Shot+MBTI	1.58	4.85	0.44	3.32	0.42
LoRA Adapter FT	1.72	5.60	0.35	3.49	0.45

Few-shot prompting substantially reduces perplexity. MBTI priming in prompts further improves style alignment. LoRA-driven fine-tuning yields sustained user-specific style and history resemblance, though it incurs costs due to per-user adaptation overhead (Ding et al., 16 Oct 2025).

SayNext-Chat, a dual-route MLLM, achieves state-of-the-art on SayNext-PC2K—BLEU-4 = 2.31%, ROUGE-L = 17.96%, BERTScore = 0.565, and Valence EC = 0.801—exceeding strong baselines such as GPT-4o by a factor of 2–6× in word-level match and 0.02–0.04 absolute in semantic/emotion metrics (Yang et al., 30 Jan 2026).

4. Personalization and Adaptation Methodologies

SayNext-PC provides a rich testbed for both external (prompt-based) and internal (fine-tuning-based) adaptation approaches:

Prompt-Based (External):
- Zero-shot: System prompt specifies desired user imitation.
- Few-shot: Prompt includes $k$ (input, output) exemplars from the user.
- MBTI Augmentation: Explicit MBTI summaries appended to prompt.
Fine-Tuning-Based (Internal):
- LoRA (PEFT): Low-rank adapters are injected into transformer projections, $W = W_0 + AB$ , where $A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times d}, r \ll d$ ; only $A, B$ are optimized.
- Prefix Tuning: Trainable prefix tokens prepended to each attention layer.

Recommended hyperparameters include LoRA rank $r = 8$ –$16$, prefix length $20$–$50$ tokens, learning rates $1\text{e}{-4}$ for adapters/prefixes, batch sizes $8$–$16$, and 3–5 training epochs per user (Ding et al., 16 Oct 2025).

The dataset supports multi-lingual experiments (English, Chinese, Japanese) and ablation studies on user history priming, MBTI conditioning, and scalability from PC2K to PC19K.

5. Annotation Protocols and Data Access

Annotation utilizes a combination of automated tools and human-in-the-loop validation:

Transcription: Whisper ASR, manual WER verification (mean WER 4.11%).
Turn Segmentation: Audio diarization plus manual correction for high-precision speaker boundary assignment.
Micro-annotations: Adoption of iMiGUE-labeled micro-expressions; feature alignment to 25 fps video frames.
Priming vector extraction: LLMs (GPT-4.1) cluster cognitive/emotional states, with each response annotated by cluster activation vector $[−1, 1]^{20}$ .
File formats: Metadata in CSV, video/text/annotation in standard formats (.mp4, .txt, .json, .npy). See [https://saynext.github.io/] for download and documentation (Yang et al., 30 Jan 2026).

Licensing is CC BY 4.0, enabling open research use.

6. Privacy and Engineering Considerations

SayNext-PC-derived systems, such as “SayNext-PC” as a next-token predictor, incorporate privacy-oriented practices:

User Adapter Storage: Adapters or personalized prompts are stored strictly on-device to minimize leakage risk.
Profile Token Security: MBTI/profile tokens are to be encrypted or hashed at rest.
Differential Privacy: Optional DP-SGD is recommended for adapter optimization, bounding per-user information leakage.
Data Split Discipline: Strict separation of train (first 80% of turns by user) and test (last 20%), mirroring realistic online adaptation constraints (Ding et al., 16 Oct 2025).

This design enables privacy-compliant deployment of personalized models in user-aligned dialogue applications.

7. Empirical Challenges and Future Insights

SayNext-PC experiments reveal persistent modeling challenges for LLMs and MLLMs:

Lexical overlap remains low (<20% BLEU), even for top-performing models, indicating the intrinsic difficulty of verbatim utterance anticipation.
Integration of nonverbal cues (gaze, gesture, micro-expressions) is essential but remains a limiting factor for model generalization and pragmatic nuance expression.
Dual-route, cognitively inspired architectures demonstrably improve emotion consistency (by ∼3%) and facilitate affective/perspective alignment.
Satisfactory generalization to unseen speakers or spontaneous scenarios remains an open challenge; pragmatic subtleties (sarcasm, metaphor) are often flattened to neutral predictions (Yang et al., 30 Jan 2026).

SayNext-PC thus defines a new research standard for anticipatory dialogue prediction systems, enabling advances toward context-sensitive and user-specialized AI interaction.

Markdown Report Issue Upgrade to Chat

References (2)

Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation (2025)

SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Prediction? (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SayNext-PC.

SayNext-PC: Multimodal Next-Utterance Prediction

1. Dataset Composition and Construction

2. Problem Formulation and Theoretical Foundations

3. Evaluation Metrics and Benchmarking

4. Personalization and Adaptation Methodologies

5. Annotation Protocols and Data Access

6. Privacy and Engineering Considerations

7. Empirical Challenges and Future Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SayNext-PC: Multimodal Next-Utterance Prediction

1. Dataset Composition and Construction

2. Problem Formulation and Theoretical Foundations

3. Evaluation Metrics and Benchmarking

4. Personalization and Adaptation Methodologies

5. Annotation Protocols and Data Access

6. Privacy and Engineering Considerations

7. Empirical Challenges and Future Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research