Physics-Grounded Video Generation
- Physics-grounded video generation is a method that integrates explicit physical laws into generative video models to ensure realistic dynamics and adherence to real-world invariants.
- It utilizes techniques like relational distillation, reinforcement learning with physics-based rewards, and latent physics conditioning to maintain object permanence and temporal consistency.
- This approach addresses challenges such as simulating rigid-body interactions, managing computational costs, and extending to complex deformable and fluid dynamics.
Physics-grounded video generation refers to data-driven video generation methods that explicitly encode, enforce, or adapt to the physical laws and commonsense dynamics governing the real world. The motivation is to overcome the common deficiency of standard video diffusion or transformer-based models, which, despite producing high-fidelity and realistic imagery, frequently violate causality, Newtonian mechanics, continuity, or invariants such as mass and energy. Physics-grounded methods inject structure through simulation, reward-based optimization, representation learning, or architectural priors to ensure that generated videos are not just plausible by appearance or motion statistics but also adhere to real-world physical principles across domains such as solid mechanics, fluid dynamics, and articulated motion.
1. Motivation and Challenges in Physics-Grounded Video Generation
Modern text-to-video (T2V) and image-to-video (I2V) diffusion models—such as CogVideoX, VideoCrafter2, OpenSora, and derivatives—achieve photorealism and in-distribution motion but frequently violate elementary physical laws. Characteristic failures include interpenetrating rigid bodies, non-inertial motion, time-inconsistent acceleration, or irreconcilable causal sequences. The root causes are twofold:
- Limited physics inductive bias: Pretraining is optimized for visual denoising, not for modeling physical structure or dynamical invariants.
- Model–data mismatch: Scaling video–text corpora improves some motion statistics but does not guarantee adherence to rigid-body, fluid, or energy conservation laws, especially out-of-distribution.
Directly training VDMs (Video Diffusion Models) with loss functions tailored to visual realism or global statistics frequently leads to impressive visuals but fails to capture object permanence, temporally coherent trajectory, or plausible interaction mechanics. Empirically, video understanding foundation models (VFMs) trained using self-supervised tasks such as masked frame prediction (e.g., VideoMAEv2, V-JEPA) encode significantly greater physics reasoning than even multi-billion parameter generative models (Zhang et al., 29 May 2025). This physics understanding gap motivates the explicit grounding of generative video synthesis in physical priors or constraints.
2. Knowledge Distillation from Physics-Aware Foundation Models
A major line of research aligns video generation models with physics-aware representation spaces distilled from video understanding foundation models. VideoREPA (Zhang et al., 29 May 2025) exemplifies this approach by introducing Token Relation Distillation (TRD), a relational loss that soft-aligns the pairwise similarities of spatio-temporal token embeddings between a frozen teacher VFM (e.g., VideoMAEv2) and a student T2V model (CogVideoX). Rather than direct feature matching—which can destabilize pretrained generative models—TRD matches the cosine similarity matrices among spatial or temporally separated token pairs. The overall loss combines the standard diffusion denoising objective with the TRD term:
This "soft" distillation improves the encoding of physical dynamics in the generative model, evidenced by substantial increases in Physical Consistency (PC) and Semantic Adherence (SA) scores on established benchmarks such as VideoPhy and Physion. TRD-based methods generalize prior “representation alignment” (REPA) schemes by (i) matching relations, not features, (ii) incorporating temporal as well as spatial alignment, and (iii) preserving finetuning stability by clamping over-alignment (Zhang et al., 29 May 2025).
Comparative Evaluation (VideoREPA, CogVideoX)
| Model | VideoPhy PC | VideoPhy2 PC | Physion Acc. |
|---|---|---|---|
| CogVideoX-2B | 26.2 | 67.97 | ≈58% |
| VideoREPA-2B | 29.7 | 72.54 | ≈71% |
| CogVideoX-5B | 31.4 | – | – |
| VideoREPA-5B | 40.1 | – | – |
This highlights that deep physics-aware representations can be “injected” into generative video models via relational distillation, with qualitative improvements in realistic object rolling, collision outcomes, and fluid motion (Zhang et al., 29 May 2025).
3. Reinforcement Learning and Physics-Constraint Reward Optimization
A second strategy enforces physical laws via reward-based or preference-based optimization over model generations. The reinforcement learning (RL) paradigm can directly incorporate physics constraints as explicit rewards, enabling strict enforcement of physical structure.
PhysRVG (Zhang et al., 16 Jan 2026) introduces physics-aware RL for video generation, focusing on Newtonian rigid-body collision dynamics. The main contribution is the Mimicry-Discovery Cycle (MDcycle), which alternates between a supervised, pixel-space flow-matching loss (to anchor the generator in the data manifold) and an RL loss (Group Relative Policy Optimization, GRPO) that maximizes collision-based rewards computed from deviations between generated and reference object centroids, weighted by detected collision frames. The reward:
enforces trajectory and collision fidelity on a per-frame, per-object basis, evaluated on PhysRVGBench, a curated set of collision-rich rigid-body scenes. MDcycle stabilizes optimization, allowing for large group sizes (640 samples per minibatch) and parameter-efficient adaptation via LoRA (Zhang et al., 16 Jan 2026).
| Method | IoU (collision) | TO (traj offset) |
|---|---|---|
| Kling2.5 | 0.70 | 103.22 |
| Magi-1 | 0.64 | 113.00 |
| Baseline+FT+RL | 0.61 | 17.25 |
| Baseline+FT+MDcycle | 0.64 | 15.03 |
| PhysRVG (ours) | 0.64 | 15.03 |
These results demonstrate large gains in collision and trajectory accuracy relative to both image-to-video and pretrained video-to-video baselines.
4. Latent Physics Conditioning and Task-Driven Attention Injection
Recent designs explicitly inject compact encodings of physical state or “physics tokens” into the conditional stream of video diffusion models. PhysVideoGenerator (Satish et al., 7 Jan 2026) regresses high-level physical representations (extracted from a powerful video predictive world model, V-JEPA 2) from noisy latent states using a lightweight predictor network (PredictorP). These physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) through cross-attention blocks:
with , derived from the predicted physics embedding. Multi-task optimization minimizes both the denoising loss and a physics regression loss to maintain stable, physics-aware latent trajectories (Satish et al., 7 Jan 2026).
This paradigm demonstrates that sufficient physical structure can be recovered from latent states, and that explicit, differentiable physics signals can stably guide video generation without requiring external simulators.
5. Preference Optimization and Automated Physics Reward Models
Direct preference optimization (DPO), previously successful in RLHF for language, has been adapted to guide video generation towards physically plausible outputs by leveraging dual-dimensional reward models.
PhysCorr (Wang et al., 6 Nov 2025) introduces PhysicsRM, a reward model that separately quantifies intra-object temporal stability and inter-object physical interaction (e.g., correct collision, trajectory continuity), and PhyDPO, a DPO pipeline that integrates these physics-aware preference signals in a model-agnostic way. PhysicsRM checks framewise DINOv2 feature consistency (subject-consistency) and answers physics questions with a distilled LLM to assess mechanics (interactions). These scalar rewards are blended and used to mine hardest negative video pairs, with reweighting for counter-balancing under-sampled difficult physics regimes. This method achieves measurable improvements in physical realism metrics (e.g., Motion Rationality +29.9%) while maintaining visual and semantic fidelity (Wang et al., 6 Nov 2025).
Similarly, PhyGDPO (Cai et al., 31 Dec 2025) generalizes DPO to groupwise feedback via Plackett–Luce modeling (GDPO), introducing Physics-Guided Rewarding (PGR) that ties preference weights directly to vision-LLM (VLM) physics scores. PhyAugPipe, a large preference pool of 135K physics-rich pairs, anchors model optimization in categories where physics errors are most challenging, validated by leading scores on VideoPhy2 and PhyGenBench (Cai et al., 31 Dec 2025).
6. Simulation-Principled and Trajectory-Guided Generation
Physics-grounded video methods also embed physical simulation—often differentiable or learning-augmented—directly into the generation pipeline:
- PhysGen3D (Chen et al., 26 Mar 2025) and PhysMotion (Tan et al., 2024) reconstruct explicit 3D geometry from a single image, infer object/material properties using VLMs, and run MPM-based continuum mechanics simulations to govern subsequent dynamics. Rendered trajectories are composited with photorealistic rendering and refined via diffusion models, producing videos where object motion, collision, and deformation reflect correct material and force cues.
- PhysGen (Liu et al., 2024), MotionCraft (Aira et al., 2024), and Physics-Grounded Motion Forecasting via Equation Discovery (Feng et al., 9 Jul 2025) extract motion trajectories or optical flow derived from black-box or analytic simulators, then use these as strong, low-level motion priors to steer diffusion models in latent or pixel space. The generator is thus “clamped” to physics-plausible motion while maintaining generative flexibility for appearance and detail.
In PhysCtrl (Wang et al., 24 Sep 2025), a generative physics network modeled as a diffusion transformer predicts 3D point trajectories for diverse materials, explicitly conditioned on physical controls (e.g., external force, Young’s modulus). The predicted trajectories serve as control conditions for video diffusion, producing photorealistic yet physically-directed I2V generations.
| Method | Physics Plaus. Pref. (%) | Video Quality Pref. (%) |
|---|---|---|
| PhysCtrl | 81.0 | 66.0 |
| Wan2.1 | 10.2 | 18.3 |
| CogVideoX | 5.5 | 6.2 |
Such hybrid pipelines bridge the gap between simulative faithfulness and generative expressiveness.
7. Limitations, Open Challenges, and Future Directions
While recent frameworks produce measurable advances, a number of challenges remain:
- Deformable and multi-body interactions: Most methods excel on rigid objects and Newtonian primitives but struggle with fluids, articulated motion, or dynamic materials, due to the complexity of modeling and reward definition (Zhang et al., 29 May 2025, Wang et al., 6 Nov 2025).
- High memory and computational costs: Some techniques require pre-trained physics encoders with large token dimensions (e.g., V-JEPA 2, 2048×1408; (Satish et al., 7 Jan 2026)), or reinforcement learning with large group sizes and LoRA adapters (Zhang et al., 16 Jan 2026, Cai et al., 31 Dec 2025).
- Automating reward and weighting selection: Scalar weights for TRD losses, reward model temperature, and margin parameters are ad-hoc and dataset-dependent (Zhang et al., 29 May 2025, Cai et al., 31 Dec 2025).
- Evaluation and benchmarking: Most evaluation is performed on recently curated datasets (e.g., VideoPhy, Physion, PhyGenBench, VideoPhy2), and precise, physics-centric large-scale benchmarks are still evolving.
- Model-agnosticity and plug-and-play consistency: Preference-optimization and trajectory-guided architectures (PhysCorr (Wang et al., 6 Nov 2025), PhyGDPO (Cai et al., 31 Dec 2025), Physics-Grounded Motion Forecasting (Feng et al., 9 Jul 2025)) are notably architecture-neutral, which is a strength for broad adoption.
Future work will focus on integrating end-to-end differentiable simulators into the denoising loop, developing automatic physics-aware preference evaluators, extending methods to soft/elastic/viscoelastic/fluids, and improving the robustness and efficiency of large physics encoders. The field is converging on hybrid pipelines that couple generative expressiveness with explicit, verifiable physical priors to ensure that videos are not just semantically plausible and visually compelling, but also grounded in the fundamental dynamics of the physical world.