RealCam-I2V: Diffusion Video Synthesis
- The paper introduces RealCam-I2V, a diffusion-based video generation framework that leverages metric monocular depth for robust 3D scene reconstruction and absolute-scale camera trajectory extraction.
- It employs a novel methodology that integrates single-image depth estimation with absolute camera parameter extraction to map relative camera movements onto a metric scale.
- The framework enhances video quality through scene-constrained noise shaping and an interactive 3D trajectory-drawing interface, bridging academic innovation with practical video synthesis applications.
RealCam-I2V is a diffusion-based video generation framework that addresses the challenge of precise, real-world image-to-video synthesis (I2V) with complex, interactive camera control. It is distinguished by its integration of metric monocular depth estimation for single-image 3D scene reconstruction and its ability to map relative camera parameters to an absolute, metric scale. RealCam-I2V offers a 3D interactive trajectory-drawing interface and introduces a targeted noise-shaping strategy to enhance camera controllability and video quality, with demonstrated improvements on both benchmark and out-of-domain imagery (Li et al., 14 Feb 2025).
1. Metric Depth-Based 3D Scene Reconstruction
RealCam-I2V adopts a monocular metric depth predictor, specifically the "Depth Anything V2 (Large Indoor)" model, finetuned for up to 20 meters. From an input image , the network yields a per-pixel metric depth map: Given camera intrinsics , each pixel is projected into the camera's 3D coordinate system: This procedure yields a dense, metric-scaled point cloud for the entire scene, establishing a real-world 3D geometry from a single image, which is essential for both interactive camera control and absolute scale consistency across sessions and images.
2. Absolute-Scale Camera Parameter Extraction
A key advancement of RealCam-I2V is its ability to reconstruct absolute (metric) camera trajectories. Conventional datasets such as RealEstate10K provide only relative-scale world-to-camera transformations, . To achieve metric compatibility, RealCam-I2V:
- Inverts to camera-to-world matrices:
- Rebases trajectories such that the first camera is the identity, extracting relative rotation and translation .
- Computes a per-clip scale factor by median-matching corresponding metric and Structure-from-Motion (COLMAP) point cloud distances.
- Scales the translation part:
The resulting sequence encodes camera trajectories at the correct physical scale, ensuring that trajectories specified—via data or user input—are meaningful and consistent across varying sources.
3. Diffusion-Based Video Generation and Conditioning
Video generation is formulated as conditional denoising diffusion in the latent space. Latent videos , of shape , are subjected to a forward stochastic differential equation: Model training targets the prediction of , with a loss: Conditioning inputs are:
- : optional text prompts
- : image embedding of the reference
- : absolute-scale camera trajectory
At inference, the user interacts with the reconstructed 3D point cloud through a lightweight 3D viewer, drawing camera-trajectory keyframes, which are interpolated (in SE(3), via slerp and linear translation) to match the desired video length.
4. Scene-Constrained Noise Shaping
RealCam-I2V proposes scene-constrained noise shaping to enforce accurate camera motion and geometry during the initial, high-noise stages of denoising:
- A preview static video is synthesized for all frames under the user-specified trajectory (but without introducing temporal dynamics), producing latents .
- For each generation frame, a binary visibility and depth-accuracy mask is created, excluding pixels near depth discontinuities.
- The noise shaping operation blends the preview and predicted latents:
- Applied only in the highest 5–10% of the schedule (i.e., when ), this step rigidly establishes the structure before the denoising process freely refines dynamic content at lower noise regimes.
This mechanism ensures that early generative steps are clamped to plausible scene geometry and camera motion, mitigating drift and inconsistent camera behaviors.
5. Experimental Protocol and Comparative Results
Training and evaluation employ the RealEstate10K dataset, filtered for scale consistency, yielding 58K training and 6K testing clips. The system operates on 16-frame video snippets, using a frozen DynamiCrafter backbone and 50K training steps on multi-GPU hardware.
Performance metrics (averaged over five COLMAP or GLOMAP runs per test sample) include:
- RotErr: (rotation error in radians)
- TransErr: (translation error, metric)
- CamMC: (camera motion consistency)
- FVD: Fréchet Video Distance to ground truth
Table of results excerpt:
| Method | RotErr ↓ | TransErr ↓ | CamMC ↓ | FVD ↓ |
|---|---|---|---|---|
| DynamiCrafter | 3.34 | 14.14 | 15.73 | 106.0 |
| + MotionCtrl | 1.04 | 5.85 | 6.26 | 67.3 |
| + CameraCtrl* | 0.74 | 5.51 | 5.76 | 69.2 |
| + CamI2V* | 0.50 | 3.41 | 3.58 | 63.9 |
| + RealCam-I2V | 0.41 | 2.47 | 2.61 | 55.2 |
(*reproduced baselines)
RealCam-I2V achieves a reduction in absolute translation and camera motion error and a improvement in video quality (FVD), compared to its strongest prior.
Ablation studies isolate the benefits: absolute scale training reduces drift and enhances camera accuracy, while noise shaping improves trajectory adherence but can degrade scene dynamics if over-applied. Their combination yields optimal joint camera–scene fidelity.
6. Interactive Applications and Usability Enhancements
The interactive 3D trajectory interface allows users to specify camera paths by direct manipulation within the reconstructed scene. Generated applications demonstrated include:
- Camera-controlled loop generation: Specifying cyclic trajectories to create seamless video loops.
- Generative frame interpolation: Conditioning the model on two scene-aligned keyframes and interpolating intermediate frames with accurate camera motion.
- Long-video continuation: Compositing multiple trajectory-conditioned snippets to synthesize extended sequences.
The design, especially the metric scene reconstruction and interactive control, enables consistent performance for arbitrary real-world or out-of-domain images and broadens practical usability beyond the limits of relative-scale or text-conditioned I2V pipelines.
7. Impact, Generalization, and Future Directions
RealCam-I2V bridges the disconnect between relative-scale, academic trajectory control and metric-accurate, user-facing workflows in video synthesis. By leveraging single-image depth to obtain true-scale 3D structure and aligning all stages of model operation to this frame, it paves the way for scalable, robust trajectory control without requiring expert knowledge of scene geometry.
Limitations include the dependency on depth estimator quality, challenges in extremely dynamic or unstructured scenes, and the memory/performance trade-offs at higher resolutions or frame counts. Anticipated future work includes public release of annotations, code, and checkpoints, as well as investigation into more advanced constraint mechanisms, multimodal controls, and further reduction of domain gaps (Li et al., 14 Feb 2025).
RealCam-I2V’s modular architecture, with targeted improvements in camera controllability, geometric consistency, and usability, sets a benchmark for real-world, interactive image-to-video generation pipelines.