XMiniGrid-Stateful Benchmark
- XMiniGrid-Stateful is a procedurally generated, text-based grid environment with a fixed layout across episodes, enabling cross-episode memory utilization.
- The benchmark employs hindsight trajectory rewriting, converting failures into alternative subgoals to enhance sample efficiency and reward maximization.
- Empirical evaluations show that techniques like memory compression and counterfactual analysis boost performance by up to 80% over static one-shot approaches.
The XMiniGrid-Stateful benchmark is a procedurally generated, partially-observable text-based navigation and planning environment designed to assess the capacity of language-model (LM) agents for sample-efficient online learning via experience accumulation and hindsight trajectory rewriting. Unlike traditional single-episode paradigms, XMiniGrid-Stateful systematically exposes agents to the same underlying environment across multiple episodes, enabling investigation of how agents can leverage cross-episode memory for improved adaptation and reward maximization. The benchmark forms a core component of the empirical evaluation in "Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting," which introduces the ECHO framework and demonstrates its advances over established baselines through rigorous experimental protocols (Hu et al., 11 Oct 2025).
1. Environment Design and Statefulness
The benchmark underlying environment is a procedurally generated 2D grid world ("MiniGrid") composed of four connected rooms, each separated by walls and doors. Each room contains exactly one unique “pickup-able” object, for a total of four per environment. The agent's spawn point and the locations of all objects are randomly sampled once and then held fixed for the duration of trials in a given environment, facilitating cross-episode knowledge integration.
Observations at each timestep consist of an egocentric text description of immediate surroundings—e.g., spatial relations to walls, doors, and objects—augmented by explicit valid/invalid action lists for the current state. The discrete action space includes movement (GoForward, TurnLeft, TurnRight) and environment interaction primitives (PickUp, Drop, Toggle for doors). Each executed action deterministically updates the underlying grid state according to transition function : , with the agent perceiving only a partial text-based projection .
Standard XMiniGrid re-samples both environment and goal for each episode, so agent memory cannot compound. In contrast, XMiniGrid-Stateful maintains a constant environment layout and fixed spawn/object positions across episodes. After each episode, the agent retains a “scratchpad” of past interactions—such as hints, workflows, or rewritten trajectories—within its prompt context. Only the goal instruction is re-sampled and the agent's state is reset to the initial configuration for the new episode.
2. Task Suite and Hindsight Trajectory Rewriting
The task suite consists of navigation/planning objectives drawn from the set , with a new goal sampled for each episode. Each episode is bounded by a fixed horizon steps, and a reward of 1 is awarded for successful completion within ; otherwise, the reward is 0.
Failure cases (episodes where the agent fails to pick up the assigned object) are leveraged for learning via ECHO’s hindsight trajectory rewriting. The ECHO framework diagnostically inspects failed trajectories to identify alternative objects that were observed or could feasibly have been picked up, surfacing these as subgoals . For each subgoal, the LM agent synthesizes an optimized high-level workflow from the episode summary. Operationally, this is performed by the LM via:
- Summarizing the episode trajectory,
- Identifying all feasible goals encountered,
- Generating counterfactual trajectories for each subgoal.
As a result, ECHO generalizes standard Hindsight Experience Replay (HER) to the language agent setting by producing rewritten, goal-conditioned trajectories for every reachable subgoal rather than simple goal relabeling.
3. Evaluation Protocol and Metrics
Each benchmark trial consists of 10 distinct procedurally generated grid layouts. For each layout, 16 queries (goals) are solved, yielding up to 160 goal-episodes per trial. Every episode is capped at steps, for a total evaluation budget of up to 10,240 agent steps.
Performance is assessed using three primary metrics:
- Final Average Reward (Success Rate):
where indicates task completion (success) in episode , with .
- Cumulative Average Reward (Sample Efficiency):
analyzing how rapidly agents improve over episodes. Rapid early growth in indicates higher sample efficiency.
- Relative Improvement over Baseline:
All performance is reported relative to a static ReAct agent.
4. Baseline Architectures and Comparative Performance
Five agent architectures are benchmarked:
| Method | Memory/Update Strategy | Reflection/Success Use |
|---|---|---|
| ReAct | No cross-episode memory; chain-of-thought per step | None |
| Reflexion | Generic post-episode notes appended to semantic memory | Notes/Failures |
| AWM | Stores “workflow” summary for successful episodes only | Successes only |
| AWM++ | Applies ECHO’s memory compression to successful episodes | Compressed Successes only |
| ECHO | Stores/rewrites workflows for both successes and failures | Hindsight Rewrites for All Episodes |
Quantitative evaluation in Table 1 of the source demonstrates that ECHO achieves a mean success rate of , reflecting an approximately 80% improvement in average reward over the ReAct baseline and a 42% improvement over the next best agent. ECHO achieves higher cumulative average reward than all other agents, surpassing the ReAct curve after only three episodes and maintaining the highest sample efficiency across the full evaluation protocol.
5. Mechanisms Underlying Enhanced Sample Efficiency
The sample efficiency observed for ECHO stems from multiple mechanisms:
- Conversion of Failures to Synthetic Successes: ECHO regularizes its memory by rewriting failed episode trajectories into multiple goal-conditioned exemplars, effectively increasing the density of learnable successes even in the presence of sparse original rewards.
- Memory Compression: ECHO’s update protocol retains only the shortest valid workflow per goal, producing a compact, high-quality memory bank for prompt-based retrieval. For comparison, AWM++ applies this compression selectively only to successful episodes, yielding smaller gains.
- Counterfactual Validity: Automated validity analysis confirms that 85% (34/40) of ECHO-synthesized workflows result in actual task success when executed, substantiating the realism and utility of trajectory rewriting.
6. Implications and Scalability Considerations
Although XMiniGrid-Stateful experiments are conducted on 4-room grid layouts, the reliance on pretrained LM “world modeling” and natural-language trajectory editing suggests inherent scaling potential to larger grids and more complex procedural environments, provided that subgoal identification remains feasible. A plausible implication is that ECHO’s framework may generalize to other domains where hindsight trajectory rewriting and compact memory bank construction can amortize sparse rewards and accelerate adaptation.
7. Significance for Language Agent Research
The XMiniGrid-Stateful benchmark concretely demonstrates the transformative effect of stateful memory and hindsight trajectory rewriting in converting language-model agents from one-shot solvers into sample-efficient learners. By leveraging natural-language “world modeling” and generalizing off-policy RL paradigms such as HER into the linguistic domain, ECHO-driven agents exhibit markedly superior performance in both reward attainment and adaptability, as evidenced by both average and cumulative reward metrics (Hu et al., 11 Oct 2025).
The rigorous comparison against existing baselines highlights the importance of cross-episode memory management, sophisticated trajectory rewriting, and memory compression for advancing sample efficiency in sequential text-based agent environments.