Relative performance of models under MineNPC-Task

Determine the relative performance of different large language models and agent configurations when evaluated within the MineNPC-Task framework, which constrains perception and action to Mineflayer APIs under a bounded-knowledge policy, by conducting controlled ablations and cross-model comparisons on the same user-authored tasks and validator-backed success criteria.

Background

The paper reports only a single-model snapshot using GPT-4o and explicitly notes the absence of ablations and cross-model comparisons. Because the evaluation framework enforces a bounded-knowledge policy and uses validator-backed judging on user-authored tasks, determining how different models compare requires controlled experiments under identical conditions.

This open question is central to establishing comparative benchmarks and reproducible baselines for memory-aware, mixed-initiative LLM agents in open-world Minecraft, and motivates future multi-model evaluations using the released harness, templates, and validators.

References

Model coverage. We report a single-model snapshot (GPT-4o). There are no ablations or cross-model comparisons, so relative performance remains an open question.

MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents  (2601.05215 - Doss et al., 8 Jan 2026) in Limitations, Model coverage