Generalization of interaction awareness beyond English and short-horizon tasks

Determine whether interaction awareness, as measured by the genuine-followup rate in user-turn generation, generalizes to multilingual settings, code generation tasks, and longer-horizon multi-turn interactions beyond the English-only conversational domains evaluated in HealthBench and Coval.

Background

The paper introduces user-turn generation and a genuine-followup evaluation to probe whether LLMs encode interaction awareness. While held-out evaluations use HealthBench and Coval to partially validate the approach, both datasets are English-only and focus on relatively short conversational horizons.

The authors explicitly note that they did not test multilingual, code generation, or longer-horizon multi-turn contexts. Establishing whether the observed interaction awareness and evaluation methodology transfer to these settings remains an unresolved question.

References

Our held-out evaluations (HealthBench, Coval) partially address this, but both are English-only conversational domains; generalization to multilingual settings, code generation, or longer-horizon multi-turn interactions remains untested.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models  (2604.02315 - Shekkizhar et al., 2 Apr 2026) in Discussion and Conclusion — Limitations