Effect of observation history on longer-horizon tasks

Determine how incorporating observation history affects the performance of LLM-based web agents on tasks with horizons longer than 15 steps, and characterize how performance scales with history length in such settings.

Background

The reported experiments investigate observation history using accessibility-tree inputs on WorkArena L1 tasks, which are at most 15 steps long, and show that history generally improves success rates.

The authors explicitly state that the impact of observation history on longer-horizon tasks is unknown, leaving open how benefits scale as task length increases.

References

Furthermore, WorkArena L1 involves tasks of up to 15 steps, and the effect of observation history on longer-horizon tasks remains unknown.

Read More, Think More: Revisiting Observation Reduction for Web Agents  (2604.01535 - Enomoto et al., 2 Apr 2026) in Limitation, Section: Observation history with richer representations