Logit Dynamics in Softmax Policy Gradient Methods

Published 15 Jun 2025 in cs.LG, cs.AI, and stat.ML | (2506.12912v1)

Abstract: We analyzes the logit dynamics of softmax policy gradient methods. We derive the exact formula for the L2 norm of the logit update vector: $$ |\Delta \mathbf{z}|_2 \propto \sqrt{1-2P_c + C(P)} $$ This equation demonstrates that update magnitudes are determined by the chosen action's probability ($P_c$) and the policy's collision probability ($C(P)$), a measure of concentration inversely related to entropy. Our analysis reveals an inherent self-regulation mechanism where learning vigor is automatically modulated by policy confidence, providing a foundational insight into the stability and convergence of these methods.