Do maximum-entropy policy learning methods produce or require 'Sheldon' behaviors for generalization?

Determine whether learning maximum-entropy policies (e.g., DIAYN-style diversity objectives) can produce deterministic 'Sheldon' agent behaviors that always go to a fixed landmark in the Cooperative Navigation environment, and ascertain whether inclusion of such 'Sheldon' policies in the partner distribution is necessary to achieve robust generalization of agents trained with MADDPG.

Background

To address co-adaptation and poor generalization, the authors consider diversifying partner behaviors during training. They cite maximum-entropy policy learning (e.g., Eysenbach et al., 2018) as a potential approach for generating diverse behaviors.

However, it is explicitly stated that it is unclear whether such methods would yield the extreme, deterministic 'Sheldon' policies used in their evaluations, or whether such policies are actually necessary to drive generalization. This creates two related uncertainties: the capability of maximum-entropy methods to produce such behaviors and the necessity of those behaviors in the training mix.

References

One proposed approach is to learn maximum entropy policies as done in . However, it is not clear if that process would ever produce ``Sheldon'' policies used in our experiments, or if they are actually needed.

Do deep reinforcement learning agents model intentions?  (1805.06020 - Matiisen et al., 2018) in Discussion — Generalization in multiagent setups