Relative performance of models under MineNPC-Task
Determine the relative performance of different large language models and agent configurations when evaluated within the MineNPC-Task framework, which constrains perception and action to Mineflayer APIs under a bounded-knowledge policy, by conducting controlled ablations and cross-model comparisons on the same user-authored tasks and validator-backed success criteria.
References
Model coverage. We report a single-model snapshot (GPT-4o). There are no ablations or cross-model comparisons, so relative performance remains an open question.
— MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents
(2601.05215 - Doss et al., 8 Jan 2026) in Limitations, Model coverage