Papers
Topics
Authors
Recent
Search
2000 character limit reached

SayNext-PC: Multimodal Next-Utterance Prediction

Updated 6 February 2026
  • SayNext-PC is a multimodal dataset and personalized prediction system designed for next-utterance prediction by integrating synchronized text, video, and micro-expression data.
  • It leverages dyadic interactions from Grand Slam press conferences, offering multiple versions like SayNext-PC2K and SayNext-PC19K for scale and diverse contextual analysis.
  • The system employs both prompt-based and fine-tuning techniques such as LoRA and prefix tuning to optimize lexical, semantic, and affective metrics while ensuring user privacy.

SayNext-PC is a collection of multimodal datasets and personalization methodologies developed for next-utterance prediction in dialogue, with a particular focus on capturing individual- and context-specific word choice at the token level. It serves as the foundational dataset for the SayNext-Bench benchmark, enabling systematic evaluation of LLMs and multimodal LLMs (MLLMs) in predicting human dialogue responses by integrating both linguistic history and fine-grained nonverbal cues. SayNext-PC also refers to a personalized next-token prediction system, grounded in psychological traits and privacy-aware engineering practices, enabling the modeling of user-aligned linguistic behavior in both synthetic and real-world conversational scenarios (Ding et al., 16 Oct 2025, Yang et al., 30 Jan 2026).

1. Dataset Composition and Construction

SayNext-PC exists in several versions, notably SayNext-PC2K and SayNext-PC19K, designed around three construction criteria: (1) dyadic, one-to-one interactions, (2) synchronized text and video modalities, and (3) fixed camera view on the responder. These constraints ensure clean speaker attribution and alignment of micro-expressions with utterances.

Key statistics for the datasets include:

Subset # Videos Duration (min) # Turns # Responders Years
SayNext-PC2K 359 2,092 5,432 72 2017–2019
SayNext-PC19K 3,463 20,766 38,540 474 2017–2024

Modalities are exhaustively annotated: video (ViT-300M features, frame rate 25 fps, standardized resolutions), text (Whisper ASR, manual segmentation, mean WER 4.11%), and micro-body-language labels (aligned using iMiGUE protocol, with features such as lip movements, gaze direction, hand gestures). Record-level metadata includes clip/video IDs, subject/nationality, textual transcriptions, start/end timestamps, emotion annotations (Valence, Arousal, Dominance), micro-expression arrays, and a 20-dimensional cognitive priming vector (for modeling latent emotional/cognitive state activation during response) (Yang et al., 30 Jan 2026).

Data collection derives from large-scale post-match press conference videos from all four Grand Slam tennis tournaments, re-scraped and re-annotated for scale and continuous speaker focus.

2. Problem Formulation and Theoretical Foundations

SayNext-PC supports several task formulations:

  • Next-Utterance Prediction (Multimodal): Given a question utterance TATA and a sequence of synchronized video frames VV, predict the respondent’s next utterance TETE. Formally, the system estimates P(truth_responseTA,V)fe(TA,V)TE^P(\text{truth\_response} \mid TA, V) \approx f_e(TA, V) \to \hat{TE}.
  • Personalized Next-Token Prediction (YNTP paradigm): For user uu with context h=(x,y<t)h = (x, y_{<t}) (where xx is NPC input and y<ty_{<t} are previous user output tokens), plus user profile Uu=(pu,Hu)U^u = (p^u, H^u) (MBTI type, dialogue history), the system predicts the distribution pθ(wth,Uu)p_\theta(w_t \mid h, U^u) across candidate next tokens. The global learning objective is minimization of token-level cross-entropy:

LCE(θ)=it=1yiulogpθ(yi,tuyi,<tu,xi,Uiu)L_{CE}(\theta) = -\sum_i \sum_{t=1}^{|y^u_i|} \log p_\theta(y^u_{i,t} \mid y^u_{i,<t}, x_i, U^u_i)

(Ding et al., 16 Oct 2025).

The data supports analyses of anticipatory processing, affective alignment, and style-personalized language modeling grounded in psychologically interpretable parameters.

3. Evaluation Metrics and Benchmarking

Empirical assessment incorporates lexical, semantic, and affective metrics:

  • Lexical Overlap: BLEU-4, ROUGE-L.
  • Semantic Similarity: BERTScore-F1, SBERT cosine similarity.
  • Emotion Consistency: NRC-VAD-derived metrics along Valence, Arousal, and Dominance dimensions.

For personalized next-token prediction, cross-entropy loss and perplexity are primary, sensitive to deviance from user-specific word-choice distributions. Macro-averaged English results for baseline and personalized adaptation methods are illustrated below:

Method Cross-Ent ↓ Perplexity ↓ Sentence Sim. (M2) ↑ Style Sim. (M5) ↑ History Sim. (M6) ↑
Zero-Shot Prompting 1.82 6.17 0.43 2.88 0.43
Few-Shot Prompting 1.65 5.21 0.46 3.28 0.45
Few-Shot+MBTI 1.58 4.85 0.44 3.32 0.42
LoRA Adapter FT 1.72 5.60 0.35 3.49 0.45

Few-shot prompting substantially reduces perplexity. MBTI priming in prompts further improves style alignment. LoRA-driven fine-tuning yields sustained user-specific style and history resemblance, though it incurs costs due to per-user adaptation overhead (Ding et al., 16 Oct 2025).

SayNext-Chat, a dual-route MLLM, achieves state-of-the-art on SayNext-PC2K—BLEU-4 = 2.31%, ROUGE-L = 17.96%, BERTScore = 0.565, and Valence EC = 0.801—exceeding strong baselines such as GPT-4o by a factor of 2–6× in word-level match and 0.02–0.04 absolute in semantic/emotion metrics (Yang et al., 30 Jan 2026).

4. Personalization and Adaptation Methodologies

SayNext-PC provides a rich testbed for both external (prompt-based) and internal (fine-tuning-based) adaptation approaches:

  • Prompt-Based (External):
    • Zero-shot: System prompt specifies desired user imitation.
    • Few-shot: Prompt includes kk (input, output) exemplars from the user.
    • MBTI Augmentation: Explicit MBTI summaries appended to prompt.
  • Fine-Tuning-Based (Internal):
    • LoRA (PEFT): Low-rank adapters are injected into transformer projections, W=W0+ABW = W_0 + AB, where ARd×r,BRr×d,rdA \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times d}, r \ll d; only A,BA, B are optimized.
    • Prefix Tuning: Trainable prefix tokens prepended to each attention layer.

Recommended hyperparameters include LoRA rank r=8r = 8–$16$, prefix length $20$–$50$ tokens, learning rates 1e41\text{e}{-4} for adapters/prefixes, batch sizes $8$–$16$, and 3–5 training epochs per user (Ding et al., 16 Oct 2025).

The dataset supports multi-lingual experiments (English, Chinese, Japanese) and ablation studies on user history priming, MBTI conditioning, and scalability from PC2K to PC19K.

5. Annotation Protocols and Data Access

Annotation utilizes a combination of automated tools and human-in-the-loop validation:

  • Transcription: Whisper ASR, manual WER verification (mean WER 4.11%).
  • Turn Segmentation: Audio diarization plus manual correction for high-precision speaker boundary assignment.
  • Micro-annotations: Adoption of iMiGUE-labeled micro-expressions; feature alignment to 25 fps video frames.
  • Priming vector extraction: LLMs (GPT-4.1) cluster cognitive/emotional states, with each response annotated by cluster activation vector [1,1]20[−1, 1]^{20}.
  • File formats: Metadata in CSV, video/text/annotation in standard formats (.mp4, .txt, .json, .npy). See [https://saynext.github.io/] for download and documentation (Yang et al., 30 Jan 2026).

Licensing is CC BY 4.0, enabling open research use.

6. Privacy and Engineering Considerations

SayNext-PC-derived systems, such as “SayNext-PC” as a next-token predictor, incorporate privacy-oriented practices:

  • User Adapter Storage: Adapters or personalized prompts are stored strictly on-device to minimize leakage risk.
  • Profile Token Security: MBTI/profile tokens are to be encrypted or hashed at rest.
  • Differential Privacy: Optional DP-SGD is recommended for adapter optimization, bounding per-user information leakage.
  • Data Split Discipline: Strict separation of train (first 80% of turns by user) and test (last 20%), mirroring realistic online adaptation constraints (Ding et al., 16 Oct 2025).

This design enables privacy-compliant deployment of personalized models in user-aligned dialogue applications.

7. Empirical Challenges and Future Insights

SayNext-PC experiments reveal persistent modeling challenges for LLMs and MLLMs:

  • Lexical overlap remains low (<20% BLEU), even for top-performing models, indicating the intrinsic difficulty of verbatim utterance anticipation.
  • Integration of nonverbal cues (gaze, gesture, micro-expressions) is essential but remains a limiting factor for model generalization and pragmatic nuance expression.
  • Dual-route, cognitively inspired architectures demonstrably improve emotion consistency (by ∼3%) and facilitate affective/perspective alignment.
  • Satisfactory generalization to unseen speakers or spontaneous scenarios remains an open challenge; pragmatic subtleties (sarcasm, metaphor) are often flattened to neutral predictions (Yang et al., 30 Jan 2026).

SayNext-PC thus defines a new research standard for anticipatory dialogue prediction systems, enabling advances toward context-sensitive and user-specialized AI interaction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SayNext-PC.