"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Published 17 Jul 2025 in cs.CV and cs.AI | (2507.13428v1)

Abstract: Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles like object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel ""Anti-Physics"" category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that could utilize current MLLM to evaluate the physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with a detailed comparison and analysis. we identify pivotal challenges models face in adhering to real-world physics. Through systematic testing of their outputs across 1,050 curated prompts-spanning fundamental, composite, and anti-physics scenarios-we identify pivotal challenges these models face in adhering to real-world physics. We then rigorously examine their performance on diverse physical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles.

Abstract PDF Upgrade to Chat

Summary

The paper’s main contribution is the introduction of PhyWorldBench, a benchmark that assesses text-to-video models' adherence to physical laws using dual evaluation metrics.
It employs a structured three-step formation process with expert-designed prompts to simulate scenarios from basic kinematics to complex multi-object interactions.
Results from 12 state-of-the-art models reveal challenges in temporal coherence and dynamics, especially under anti-physics conditions.

Summary of "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Introduction

The paper presents PhyWorldBench, a benchmark for evaluating the realism of text-to-video models in simulating physical phenomena. The focus is on assessing how these models adhere to physical laws across multiple complexity tiers, ranging from basic object movements to intricate interactions in human and animal dynamics. PhyWorldBench further explores "Anti-Physics" scenarios, which intentionally violate genuine physics rules to test models' capabilities in handling imaginative or impossible scenes.

Figure 1: Creation Process of PhyWorldBench, detailing the pipeline of prompt design.

Design and Implementation

PhyWorldBench is built through an articulated three-step formation process involving experts in physics and technology:

Category Definition: Detailed taxonomy was designed with ten main categories like Kinematics, Fluid Dynamics, and Energy Conservation, carved into increasingly complex scenarios.
Prompt Development: For each scenario, prompts are created at three levels of detail: Event Prompt, Physics-Enhanced Prompt, and Detailed Narrative Prompt. This aids in examining how models react to progressively enriched context.
Figure 2: Example of prompts illustrating different descriptions for identical physical scenarios.

Evaluation Methodology

The evaluation employs a dual-standard metric: Basic and Key Standards. Basic Standards ensure elementary event representation, while Key Standards evaluate adherence to actual physical laws. A Yes/No assessment structure provides simplicity and objectivity in evaluating the realism of model-generated videos.

Figure 3: Evaluation process highlighting instances of failure to adhere to physical standards.

Results and Analysis

Twelve state-of-the-art models were subjected to this benchmark, revealing prevailing challenges such as achieving temporal coherence, accurate motion dynamics, and realistic rendering under complex or "anti-physics" conditions. Variations in success rates highlighted divergent performance among models, with proprietary models like Pika 2.0 leading overall physical realism.

Figure 4: Success rates of different models when assessed against PhyWorldBench.

Automated Evaluation Using MLLM

The study introduced the Context-Aware Prompt (CAP) for using LLMs in zero-shot evaluation of video realism. CAP enhances evaluation accuracy by providing context about the video’s generated nature, resulting in improved objectivity over existing metrics.

Figure 5: Context-aware prompts influencing the assessment quality for improved video evaluations.

Challenges in Multi-object Scenarios

The paper identifies that complexity, such as involving multiple interactive objects, escalates the difficulty for models in maintaining realism and semantic adherence. This failure elevates challenges beyond scaling or improvement through simplistic prompt design, indicating an area ripe for future explorations.

Conclusion

PhyWorldBench provides a structured foundation for evaluating and enhancing the physical realism of video generation models. As models advance, this benchmark serves as a critical tool, offering insights into capabilities and supporting progress towards more robust, physics-aware video generation technologies. Future works should address the detailed representation of complex physical environments, leveraging the reflections afforded by this benchmark.

The insights and methodologies from this study pave the way for a richer understanding of the limitations and potential of current AI-driven video generation, guiding future research into more physically coherent model architectures.

Markdown