Do maximum-entropy policy learning methods produce or require 'Sheldon' behaviors for generalization?
Determine whether learning maximum-entropy policies (e.g., DIAYN-style diversity objectives) can produce deterministic 'Sheldon' agent behaviors that always go to a fixed landmark in the Cooperative Navigation environment, and ascertain whether inclusion of such 'Sheldon' policies in the partner distribution is necessary to achieve robust generalization of agents trained with MADDPG.
References
One proposed approach is to learn maximum entropy policies as done in . However, it is not clear if that process would ever produce ``Sheldon'' policies used in our experiments, or if they are actually needed.
— Do deep reinforcement learning agents model intentions?
(1805.06020 - Matiisen et al., 2018) in Discussion — Generalization in multiagent setups