Best general method to retain variation during RL post-training
Determine an effective general method for retaining variation (diversity) in generated images during reinforcement learning post-training of diffusion-based text-to-image models, given that classifier-free guidance can reduce diversity and Kullback–Leibler (KL) regularization is commonly used but the optimal strategy for preserving variation is unknown.
References
KL regularization is typically used to better retain variation in the results, but the best way to do this in general remains an open problem.
— Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models
(2603.12893 - McAllister et al., 13 Mar 2026) in Section 6, Discussion and Future Work