XMiniGrid-Stateful Benchmark

Updated 8 January 2026

XMiniGrid-Stateful is a procedurally generated, text-based grid environment with a fixed layout across episodes, enabling cross-episode memory utilization.
The benchmark employs hindsight trajectory rewriting, converting failures into alternative subgoals to enhance sample efficiency and reward maximization.
Empirical evaluations show that techniques like memory compression and counterfactual analysis boost performance by up to 80% over static one-shot approaches.

The XMiniGrid-Stateful benchmark is a procedurally generated, partially-observable text-based navigation and planning environment designed to assess the capacity of language-model (LM) agents for sample-efficient online learning via experience accumulation and hindsight trajectory rewriting. Unlike traditional single-episode paradigms, XMiniGrid-Stateful systematically exposes agents to the same underlying environment across multiple episodes, enabling investigation of how agents can leverage cross-episode memory for improved adaptation and reward maximization. The benchmark forms a core component of the empirical evaluation in "Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting," which introduces the ECHO framework and demonstrates its advances over established baselines through rigorous experimental protocols (Hu et al., 11 Oct 2025).

1. Environment Design and Statefulness

The benchmark underlying environment is a procedurally generated 2D grid world ("MiniGrid") composed of four connected rooms, each separated by walls and doors. Each room contains exactly one unique “pickup-able” object, for a total of four per environment. The agent's spawn point and the locations of all objects are randomly sampled once and then held fixed for the duration of trials in a given environment, facilitating cross-episode knowledge integration.

Observations at each timestep consist of an egocentric text description of immediate surroundings—e.g., spatial relations to walls, doors, and objects—augmented by explicit valid/invalid action lists for the current state. The discrete action space includes movement (GoForward, TurnLeft, TurnRight) and environment interaction primitives (PickUp, Drop, Toggle for doors). Each executed action deterministically updates the underlying grid state $s \in S$ according to transition function $T$ : $s_{t+1} = T(s_t, a_t)$ , with the agent perceiving only a partial text-based projection $\phi(s_t)$ .

Standard XMiniGrid re-samples both environment and goal for each episode, so agent memory cannot compound. In contrast, XMiniGrid-Stateful maintains a constant environment layout and fixed spawn/object positions across episodes. After each episode, the agent retains a “scratchpad” of past interactions—such as hints, workflows, or rewritten trajectories—within its prompt context. Only the goal instruction is re-sampled and the agent's state is reset to the initial configuration for the new episode.

2. Task Suite and Hindsight Trajectory Rewriting

The task suite consists of navigation/planning objectives drawn from the set $G = \{\text{Pick up object}_1, \ldots, \text{Pick up object}_4\}$ , with a new goal sampled for each episode. Each episode is bounded by a fixed horizon $H = 64$ steps, and a reward of 1 is awarded for successful completion within $H$ ; otherwise, the reward is 0.

Failure cases (episodes where the agent fails to pick up the assigned object) are leveraged for learning via ECHO’s hindsight trajectory rewriting. The ECHO framework diagnostically inspects failed trajectories to identify alternative objects that were observed or could feasibly have been picked up, surfacing these as subgoals $g' \in G'$ . For each subgoal, the LM agent synthesizes an optimized high-level workflow from the episode summary. Operationally, this is performed by the LM via:

Summarizing the episode trajectory,
Identifying all feasible goals encountered,
Generating counterfactual trajectories for each subgoal.

As a result, ECHO generalizes standard Hindsight Experience Replay (HER) to the language agent setting by producing rewritten, goal-conditioned trajectories for every reachable subgoal rather than simple goal relabeling.

3. Evaluation Protocol and Metrics

Each benchmark trial consists of 10 distinct procedurally generated grid layouts. For each layout, 16 queries (goals) are solved, yielding up to 160 goal-episodes per trial. Every episode is capped at $H = 64$ steps, for a total evaluation budget of up to 10,240 agent steps.

Performance is assessed using three primary metrics:

Final Average Reward (Success Rate):

$\bar{R} = \frac{1}{N} \sum_{i=1}^N R_i$

where $R_i \in \{0,1\}$ indicates task completion (success) in episode $i$ , with $N = 160$ .

Cumulative Average Reward (Sample Efficiency):

$C(\tau) = \frac{1}{\tau+1} \sum_{t=0}^{\tau} R_t$

analyzing how rapidly agents improve over episodes. Rapid early growth in $C(\tau)$ indicates higher sample efficiency.

Relative Improvement over Baseline:

$\Delta R(\text{method}) = \bar{R}_{\text{method}} - \bar{R}_{\text{ReAct}}$

All performance is reported relative to a static ReAct agent.

4. Baseline Architectures and Comparative Performance

Five agent architectures are benchmarked:

Method	Memory/Update Strategy	Reflection/Success Use
ReAct	No cross-episode memory; chain-of-thought per step	None
Reflexion	Generic post-episode notes appended to semantic memory	Notes/Failures
AWM	Stores “workflow” summary for successful episodes only	Successes only
AWM++	Applies ECHO’s memory compression to successful episodes	Compressed Successes only
ECHO	Stores/rewrites workflows for both successes and failures	Hindsight Rewrites for All Episodes

Quantitative evaluation in Table 1 of the source demonstrates that ECHO achieves a mean success rate of $\bar{R}_0 \times 1.80$ , reflecting an approximately 80% improvement in average reward over the ReAct baseline and a 42% improvement over the next best agent. ECHO achieves higher cumulative average reward $C(\tau)$ than all other agents, surpassing the ReAct curve after only three episodes and maintaining the highest sample efficiency across the full evaluation protocol.

5. Mechanisms Underlying Enhanced Sample Efficiency

The sample efficiency observed for ECHO stems from multiple mechanisms:

Conversion of Failures to Synthetic Successes: ECHO regularizes its memory by rewriting failed episode trajectories into multiple goal-conditioned exemplars, effectively increasing the density of learnable successes even in the presence of sparse original rewards.
Memory Compression: ECHO’s update protocol retains only the shortest valid workflow per goal, producing a compact, high-quality memory bank for prompt-based retrieval. For comparison, AWM++ applies this compression selectively only to successful episodes, yielding smaller gains.
Counterfactual Validity: Automated validity analysis confirms that 85% (34/40) of ECHO-synthesized workflows result in actual task success when executed, substantiating the realism and utility of trajectory rewriting.

6. Implications and Scalability Considerations

Although XMiniGrid-Stateful experiments are conducted on 4-room grid layouts, the reliance on pretrained LM “world modeling” and natural-language trajectory editing suggests inherent scaling potential to larger grids and more complex procedural environments, provided that subgoal identification remains feasible. A plausible implication is that ECHO’s framework may generalize to other domains where hindsight trajectory rewriting and compact memory bank construction can amortize sparse rewards and accelerate adaptation.

7. Significance for Language Agent Research

The XMiniGrid-Stateful benchmark concretely demonstrates the transformative effect of stateful memory and hindsight trajectory rewriting in converting language-model agents from one-shot solvers into sample-efficient learners. By leveraging natural-language “world modeling” and generalizing off-policy RL paradigms such as HER into the linguistic domain, ECHO-driven agents exhibit markedly superior performance in both reward attainment and adaptability, as evidenced by both average and cumulative reward metrics (Hu et al., 11 Oct 2025).

The rigorous comparison against existing baselines highlights the importance of cross-episode memory management, sophisticated trajectory rewriting, and memory compression for advancing sample efficiency in sequential text-based agent environments.

Markdown Report Issue Upgrade to Chat

References (1)

Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XMiniGrid-Stateful Benchmark.

XMiniGrid-Stateful Benchmark

1. Environment Design and Statefulness

2. Task Suite and Hindsight Trajectory Rewriting

3. Evaluation Protocol and Metrics

4. Baseline Architectures and Comparative Performance

5. Mechanisms Underlying Enhanced Sample Efficiency

6. Implications and Scalability Considerations

7. Significance for Language Agent Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

XMiniGrid-Stateful Benchmark

1. Environment Design and Statefulness

2. Task Suite and Hindsight Trajectory Rewriting

3. Evaluation Protocol and Metrics

4. Baseline Architectures and Comparative Performance

5. Mechanisms Underlying Enhanced Sample Efficiency

6. Implications and Scalability Considerations

7. Significance for Language Agent Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research