Generalization of interaction awareness beyond English and short-horizon tasks
Determine whether interaction awareness, as measured by the genuine-followup rate in user-turn generation, generalizes to multilingual settings, code generation tasks, and longer-horizon multi-turn interactions beyond the English-only conversational domains evaluated in HealthBench and Coval.
References
Our held-out evaluations (HealthBench, Coval) partially address this, but both are English-only conversational domains; generalization to multilingual settings, code generation, or longer-horizon multi-turn interactions remains untested.
— Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
(2604.02315 - Shekkizhar et al., 2 Apr 2026) in Discussion and Conclusion — Limitations