Uncaptured failure modes under no-feedback settings
Identify and characterize additional failure modes in visually interactive decision-making by vision–language models that are not covered by the paper’s taxonomy (restricted action space and action looping, state mismanagement, early termination, failure to use visual or spatial information), as indicated by paradoxical behavior when environment feedback is removed.
References
This suggests the presence of additional failure modes not captured by our taxonomy, which we leave for future work.
— VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
(2601.16973 - Wang et al., 23 Jan 2026) in Appendix: Analyzing Model Failures, Failure changes per ablation (Feedback vs. no feedback)