Explain the DR19 random-vs-sorted ordering performance reversal for IC-DQN and IC-IQL
Determine the underlying cause of the observation that IC-DQN (In-Context Twin Deep Q Network) and IC-IQL (In-Context Implicit Q-Learning) achieve higher performance when trained on randomly ordered trajectories than when trained on trajectories sorted by discounted return in the Dark Room 19x19 environment using complete datasets with one learning history per target and no structured learning-history ordering, and ascertain why this phenomenon appears exclusively in the Dark Room 19x19 environment.
References
Surprisingly, on DR19, both DQN and IQL perform better with randomly ordered data than with sorted samples. We do not yet have a plausible explanation for this phenomenon or why it appears exclusively in the simpler DR19 environment.
— Yes, Q-learning Helps Offline In-Context RL
(2502.17666 - Tarasov et al., 24 Feb 2025) in Subsection "No Learning Histories Structure" (Experimental Results)