InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

Published 12 Sep 2025 in cs.AI and cs.LG | (2509.12263v1)

Abstract: Large multimodal models (LMMs) encode universal physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning when the inference scenario violates these physical laws. In contrast, humans possess the skill to adapt their physical reasoning to unseen physical environments from a few visual examples. This ability, which we refer to as inductive physical reasoning, is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks evaluate only the parametric knowledge in LMMs, and not inductive physical reasoning. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs on their ability to predict the outcome of collision events in algorithmically generated synthetic collision videos. By inspecting 13 LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when demonstration samples violate universal physical laws, and (3) inductive physical reasoning in LMMs suffers from language bias and largely ignores the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.

Abstract PDF Upgrade to Chat

Summary

The paper reveals that LMMs struggle with inductive physical reasoning, particularly in collision scenarios deviating from known physical laws.
It introduces the InPhyRe benchmark, using synthetic video scenarios to test model performance under both regular and irregular physical conditions.
Experimental insights show LMMs rely heavily on parametric knowledge and language cues, highlighting the need for improved multimodal integration.

Inductive Physical Reasoning in Large Multimodal Models

The paper "InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning" addresses the gap in current benchmarks by introducing InPhyRe, a visual question answering benchmark designed to assess inductive physical reasoning in Large Multimodal Models (LMMs). The emphasis of this study is on how LMMs handle collision events that deviate from universal physical laws such as momentum conservation. The findings indicate significant challenges in LMMs' reasoning capabilities, especially when faced with unseen physical laws.

Background and Motivation

LMMs encode parametric knowledge of universal physical laws observed during training. However, their parametric knowledge is limited to scenarios encountered during training and is insufficient in situations that violate the laws encoded in their parameters. Humans, in contrast, can swiftly adapt their reasoning to novel physical environments through inductive reasoning. This paper posits that for LMMs to replace human agents in critical application domains like autonomous driving, they need robust inductive physical reasoning capabilities.

Figure 1: LMM is asked to predict the change in vertical velocity of an object colliding with a vertical wall, illustrating difficulties in inductive physical reasoning.

InPhyRe Benchmark Design

InPhyRe is the inaugural benchmark explicitly developed for evaluating inductive physical reasoning in LMMs. It includes synthetic video scenarios where LMMs predict outcomes of collision events either adhering to or deviating from known physical laws.

Scenarios and Categories

The benchmark is structured around scenarios, categorized by the physical laws they violate or adhere to:

Momentum Conservation Violation: Scenarios where linear or angular momentum principles are violated (Figure 2).
Inconsistent Physics: Scenarios where some object properties violate typical physical laws (e.g., specific colored objects behaving differently).
Miscellaneous: Scenarios assessing concepts like visual bias in physical reasoning.
Figure 2: Structure of the InPhyRe benchmark featuring scenarios that violate universal physical laws.

Evaluation Methodology

LMMs are assessed under zero-shot and few-shot settings, distinguished by whether exemplars that follow (regular) or violate (irregular) universal laws are used. The performance gap between these settings is used to measure inductive physical reasoning strength.

Open Challenges

The study identified several challenges:

Limited Parametric Knowledge: LMMs often fail basic reasoning tasks, even in regular scenarios.
Weak Inductive Reasoning: Performance significantly deteriorates in irregular scenarios where inductive reasoning is vital.
Language Bias: LMMs predominantly rely on language cues rather than visual inputs, highlighting trust issues when visual congruity with textual information is absent.

Experimental Insights

The experiments involved a diverse set of LMMs with differing architectures and parameter counts. The models displayed variable performance based on their architectural nuances.

Figure 3: Visual pipeline for generating synthetic videos used in the benchmark.

Implications and Future Directions

The findings underscore critical gaps in LMMs' reasoning capabilities—specifically, the need for models to balance their parametric knowledge with flexible, inductive reasoning capabilities. Future research could explore incorporating feedback mechanisms in training protocols to enhance inductive reasoning in LMMs. Moreover, addressing language bias by integrating more robust visual processing capabilities is another prospective pathway.

Conclusion

"InPhyRe" presents substantial insights into the limitations of current LMMs regarding inductive physical reasoning. The benchmark's value lies in its potential to guide the development of more versatile AI systems capable of understanding and reasoning about complex real-world scenarios beyond their training experiences. Future work should prioritize methods to alleviate these identified deficiencies, potentially involving adaptive learning strategies and enhanced multi-modal integrations to replicate the nuanced understanding inherent in human reasoning.

Markdown Report Issue