Uncaptured failure modes under no-feedback settings

Identify and characterize additional failure modes in visually interactive decision-making by vision–language models that are not covered by the paper’s taxonomy (restricted action space and action looping, state mismanagement, early termination, failure to use visual or spatial information), as indicated by paradoxical behavior when environment feedback is removed.

Background

The authors analyze failure patterns using an automated labeling pipeline and a predefined taxonomy of four common failure types. When textual environment feedback is removed, overall performance decreases, yet some measured failure rates (e.g., action looping and state mismanagement) also decrease—an unexpected result.

This paradox suggests the presence of additional failure modes beyond the current taxonomy, motivating further research to discover, define, and quantify these behaviors, especially in settings without textual feedback.

References

This suggests the presence of additional failure modes not captured by our taxonomy, which we leave for future work.

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents  (2601.16973 - Wang et al., 23 Jan 2026) in Appendix: Analyzing Model Failures, Failure changes per ablation (Feedback vs. no feedback)