Explain the DR19 random-vs-sorted ordering performance reversal for IC-DQN and IC-IQL

Determine the underlying cause of the observation that IC-DQN (In-Context Twin Deep Q Network) and IC-IQL (In-Context Implicit Q-Learning) achieve higher performance when trained on randomly ordered trajectories than when trained on trajectories sorted by discounted return in the Dark Room 19x19 environment using complete datasets with one learning history per target and no structured learning-history ordering, and ascertain why this phenomenon appears exclusively in the Dark Room 19x19 environment.

Background

Algorithm Distillation relies on ordered, progressively improving learning histories, which may not be available in practice. The authors evaluated RL-based In-Context approaches (IC-DQN, IC-CQL, IC-IQL) when the trajectory data were either randomly shuffled (no inherent improvement order) or sorted by discounted return to impose a heuristic order.

In Dark Room 19x19 with complete datasets and one history per target, the authors observed that IC-DQN and IC-IQL performed better with randomly ordered trajectories than with sorted-by-return samples—contrary to expectations and unlike other environments tested. They explicitly state that they currently lack an explanation for this effect and why it appears only in the simpler DR19 setting.

References

Surprisingly, on DR19, both DQN and IQL perform better with randomly ordered data than with sorted samples. We do not yet have a plausible explanation for this phenomenon or why it appears exclusively in the simpler DR19 environment.

Yes, Q-learning Helps Offline In-Context RL  (2502.17666 - Tarasov et al., 24 Feb 2025) in Subsection "No Learning Histories Structure" (Experimental Results)