Effectively leveraging production user signals for RL training
Investigate methods to effectively leverage production user behavior signals, modeled as binary classifiers predicting events such as conversation continuation and emoji reactions (e.g., p(continue), p(love), p(thumb up), p(thumb down), p(feedback) given system prompt, character instructions, conversation history, and the current response), as differentiable rewards for reinforcement learning training of large language models deployed for social chat applications.
References
We note that effectively leveraging user signals for RL training remains an open research question, and further investigation is encouraged to unlock their full potential.
— CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production
(2603.01973 - Nie et al., 2 Mar 2026) in Subsection “User Signal Models” within Section “Reward Models”