SayNext-PC: Multimodal Next-Utterance Prediction
- SayNext-PC is a multimodal dataset and personalized prediction system designed for next-utterance prediction by integrating synchronized text, video, and micro-expression data.
- It leverages dyadic interactions from Grand Slam press conferences, offering multiple versions like SayNext-PC2K and SayNext-PC19K for scale and diverse contextual analysis.
- The system employs both prompt-based and fine-tuning techniques such as LoRA and prefix tuning to optimize lexical, semantic, and affective metrics while ensuring user privacy.
SayNext-PC is a collection of multimodal datasets and personalization methodologies developed for next-utterance prediction in dialogue, with a particular focus on capturing individual- and context-specific word choice at the token level. It serves as the foundational dataset for the SayNext-Bench benchmark, enabling systematic evaluation of LLMs and multimodal LLMs (MLLMs) in predicting human dialogue responses by integrating both linguistic history and fine-grained nonverbal cues. SayNext-PC also refers to a personalized next-token prediction system, grounded in psychological traits and privacy-aware engineering practices, enabling the modeling of user-aligned linguistic behavior in both synthetic and real-world conversational scenarios (Ding et al., 16 Oct 2025, Yang et al., 30 Jan 2026).
1. Dataset Composition and Construction
SayNext-PC exists in several versions, notably SayNext-PC2K and SayNext-PC19K, designed around three construction criteria: (1) dyadic, one-to-one interactions, (2) synchronized text and video modalities, and (3) fixed camera view on the responder. These constraints ensure clean speaker attribution and alignment of micro-expressions with utterances.
Key statistics for the datasets include:
| Subset | # Videos | Duration (min) | # Turns | # Responders | Years |
|---|---|---|---|---|---|
| SayNext-PC2K | 359 | 2,092 | 5,432 | 72 | 2017–2019 |
| SayNext-PC19K | 3,463 | 20,766 | 38,540 | 474 | 2017–2024 |
Modalities are exhaustively annotated: video (ViT-300M features, frame rate 25 fps, standardized resolutions), text (Whisper ASR, manual segmentation, mean WER 4.11%), and micro-body-language labels (aligned using iMiGUE protocol, with features such as lip movements, gaze direction, hand gestures). Record-level metadata includes clip/video IDs, subject/nationality, textual transcriptions, start/end timestamps, emotion annotations (Valence, Arousal, Dominance), micro-expression arrays, and a 20-dimensional cognitive priming vector (for modeling latent emotional/cognitive state activation during response) (Yang et al., 30 Jan 2026).
Data collection derives from large-scale post-match press conference videos from all four Grand Slam tennis tournaments, re-scraped and re-annotated for scale and continuous speaker focus.
2. Problem Formulation and Theoretical Foundations
SayNext-PC supports several task formulations:
- Next-Utterance Prediction (Multimodal): Given a question utterance and a sequence of synchronized video frames , predict the respondent’s next utterance . Formally, the system estimates .
- Personalized Next-Token Prediction (YNTP paradigm): For user with context (where is NPC input and are previous user output tokens), plus user profile (MBTI type, dialogue history), the system predicts the distribution across candidate next tokens. The global learning objective is minimization of token-level cross-entropy:
The data supports analyses of anticipatory processing, affective alignment, and style-personalized language modeling grounded in psychologically interpretable parameters.
3. Evaluation Metrics and Benchmarking
Empirical assessment incorporates lexical, semantic, and affective metrics:
- Lexical Overlap: BLEU-4, ROUGE-L.
- Semantic Similarity: BERTScore-F1, SBERT cosine similarity.
- Emotion Consistency: NRC-VAD-derived metrics along Valence, Arousal, and Dominance dimensions.
For personalized next-token prediction, cross-entropy loss and perplexity are primary, sensitive to deviance from user-specific word-choice distributions. Macro-averaged English results for baseline and personalized adaptation methods are illustrated below:
| Method | Cross-Ent ↓ | Perplexity ↓ | Sentence Sim. (M2) ↑ | Style Sim. (M5) ↑ | History Sim. (M6) ↑ |
|---|---|---|---|---|---|
| Zero-Shot Prompting | 1.82 | 6.17 | 0.43 | 2.88 | 0.43 |
| Few-Shot Prompting | 1.65 | 5.21 | 0.46 | 3.28 | 0.45 |
| Few-Shot+MBTI | 1.58 | 4.85 | 0.44 | 3.32 | 0.42 |
| LoRA Adapter FT | 1.72 | 5.60 | 0.35 | 3.49 | 0.45 |
Few-shot prompting substantially reduces perplexity. MBTI priming in prompts further improves style alignment. LoRA-driven fine-tuning yields sustained user-specific style and history resemblance, though it incurs costs due to per-user adaptation overhead (Ding et al., 16 Oct 2025).
SayNext-Chat, a dual-route MLLM, achieves state-of-the-art on SayNext-PC2K—BLEU-4 = 2.31%, ROUGE-L = 17.96%, BERTScore = 0.565, and Valence EC = 0.801—exceeding strong baselines such as GPT-4o by a factor of 2–6× in word-level match and 0.02–0.04 absolute in semantic/emotion metrics (Yang et al., 30 Jan 2026).
4. Personalization and Adaptation Methodologies
SayNext-PC provides a rich testbed for both external (prompt-based) and internal (fine-tuning-based) adaptation approaches:
- Prompt-Based (External):
- Zero-shot: System prompt specifies desired user imitation.
- Few-shot: Prompt includes (input, output) exemplars from the user.
- MBTI Augmentation: Explicit MBTI summaries appended to prompt.
- Fine-Tuning-Based (Internal):
- LoRA (PEFT): Low-rank adapters are injected into transformer projections, , where ; only are optimized.
- Prefix Tuning: Trainable prefix tokens prepended to each attention layer.
Recommended hyperparameters include LoRA rank –$16$, prefix length $20$–$50$ tokens, learning rates for adapters/prefixes, batch sizes $8$–$16$, and 3–5 training epochs per user (Ding et al., 16 Oct 2025).
The dataset supports multi-lingual experiments (English, Chinese, Japanese) and ablation studies on user history priming, MBTI conditioning, and scalability from PC2K to PC19K.
5. Annotation Protocols and Data Access
Annotation utilizes a combination of automated tools and human-in-the-loop validation:
- Transcription: Whisper ASR, manual WER verification (mean WER 4.11%).
- Turn Segmentation: Audio diarization plus manual correction for high-precision speaker boundary assignment.
- Micro-annotations: Adoption of iMiGUE-labeled micro-expressions; feature alignment to 25 fps video frames.
- Priming vector extraction: LLMs (GPT-4.1) cluster cognitive/emotional states, with each response annotated by cluster activation vector .
- File formats: Metadata in CSV, video/text/annotation in standard formats (.mp4, .txt, .json, .npy). See [https://saynext.github.io/] for download and documentation (Yang et al., 30 Jan 2026).
Licensing is CC BY 4.0, enabling open research use.
6. Privacy and Engineering Considerations
SayNext-PC-derived systems, such as “SayNext-PC” as a next-token predictor, incorporate privacy-oriented practices:
- User Adapter Storage: Adapters or personalized prompts are stored strictly on-device to minimize leakage risk.
- Profile Token Security: MBTI/profile tokens are to be encrypted or hashed at rest.
- Differential Privacy: Optional DP-SGD is recommended for adapter optimization, bounding per-user information leakage.
- Data Split Discipline: Strict separation of train (first 80% of turns by user) and test (last 20%), mirroring realistic online adaptation constraints (Ding et al., 16 Oct 2025).
This design enables privacy-compliant deployment of personalized models in user-aligned dialogue applications.
7. Empirical Challenges and Future Insights
SayNext-PC experiments reveal persistent modeling challenges for LLMs and MLLMs:
- Lexical overlap remains low (<20% BLEU), even for top-performing models, indicating the intrinsic difficulty of verbatim utterance anticipation.
- Integration of nonverbal cues (gaze, gesture, micro-expressions) is essential but remains a limiting factor for model generalization and pragmatic nuance expression.
- Dual-route, cognitively inspired architectures demonstrably improve emotion consistency (by ∼3%) and facilitate affective/perspective alignment.
- Satisfactory generalization to unseen speakers or spontaneous scenarios remains an open challenge; pragmatic subtleties (sarcasm, metaphor) are often flattened to neutral predictions (Yang et al., 30 Jan 2026).
SayNext-PC thus defines a new research standard for anticipatory dialogue prediction systems, enabling advances toward context-sensitive and user-specialized AI interaction.