Behavior of Gemma-2-2b-IT on long multi-turn dialogues

Ascertain the consistency behavior of Gemma-2-2b-IT over long multi-turn dialogues by measuring prompt-to-line consistency, line-to-line consistency, and Q&A consistency at extended conversation lengths, which were not evaluated due to token length constraints.

Background

The authors analyze how consistency varies with conversation length for several models, reporting results up to very long contexts for some models. They observe model- and task-specific trends in consistency degradation and alignment across metrics.

However, for Gemma-2-2b-IT, they were unable to run long-context experiments because of token length constraints, leaving its long-horizon consistency behavior unassessed.

References

Due to token length constrains, we were unable to experiment with long dialogue lengths for gemma-2-2b-it.

Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning  (2511.00222 - Abdulhai et al., 31 Oct 2025) in Appendix: Results, Consistency over dialogue length before fine-tuning (in support of Q2)