Handling Limited Context Windows in Approximate ICRL

Develop methods to enable the Approximate In-Context Reinforcement Learning (ICRL) algorithm to effectively handle limited language model context window lengths by supporting contexts that exceed the model’s window, thereby allowing robust deployment over extended interactions without relying on unbounded context capacity.

Background

The paper introduces Explorative ICRL to address exploration deficiencies in Naive ICRL by stochastically constructing prompts and focusing on positive-reward episodes. While effective, Explorative is computationally expensive because each input requires a freshly constructed context, limiting the utility of caching.

To reduce compute, the authors propose Approximate ICRL, which maintains a fixed number of cached contexts that are stochastically expanded. This yields strong performance with reduced token processing but introduces a limitation: the approximation is not designed to handle contexts that extend beyond the model’s window length. This limitation becomes critical in long-running deployments where many interactions accumulate.

The authors explicitly identify this limitation as an open question tied to computational resource usage, highlighting the need for methods that reconcile Approximate ICRL’s efficiency with practical constraints imposed by finite context windows.

References

Our work also lays out open questions as far as the use of computational resources. However, Approximate left open the problem of working with a limited context window, a critical problem for deploying these methods for extended periods with many interactions.

LLMs Are In-Context Bandit Reinforcement Learners  (2410.05362 - Monea et al., 2024) in Section 6 (Discussion and Limitations)