Effectively leveraging production user signals for RL training

Investigate methods to effectively leverage production user behavior signals, modeled as binary classifiers predicting events such as conversation continuation and emoji reactions (e.g., p(continue), p(love), p(thumb up), p(thumb down), p(feedback) given system prompt, character instructions, conversation history, and the current response), as differentiable rewards for reinforcement learning training of large language models deployed for social chat applications.

Background

The paper trains a set of user signal models as binary classifiers to predict whether specific user behaviors occur after a model response (e.g., continuing the conversation, reacting with emojis, or providing feedback). These models use the same conversational context as the preference models and are initialized from LLaMA 3.1 checkpoints.

Although many user signals are explored, the authors ultimately found only the p(continue) and p(thumb up) models consistently reliable for rejection sampling data selection. They caution that user signals collected from organic interactions are inherently noisy and biased, and they observed failure modes when optimizing directly against such signals, motivating the open problem of how to use these signals effectively in RL training.

References

We note that effectively leveraging user signals for RL training remains an open research question, and further investigation is encouraged to unlock their full potential.

CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production  (2603.01973 - Nie et al., 2 Mar 2026) in Subsection “User Signal Models” within Section “Reward Models”