Cam×Time Dataset for Space-Time Modeling
- Cam×Time dataset is a synthetic, grid-structured video collection designed to benchmark space–time disentanglement in generative models.
- It employs full-coverage rendering with a 120×120 frame grid and distinct camera trajectories to enable precise control over spatial and temporal dynamics.
- The dataset supports tasks like re-timing, bullet-time effects, and arbitrary camera manipulation, advancing research in controllable video diffusion.
The Cam×Time dataset is a large-scale, synthetic, grid-structured collection of photorealistic videos designed explicitly for disentangling camera (space) and animation (time) control in generative video modeling. Developed as part of the SpaceTimePilot framework, Cam×Time is the first dataset to provide full-coverage supervision over both camera trajectories and temporal progressions, enabling benchmarking and development of video diffusion models that perform independent and explicit control over spatial and temporal dynamics. The dataset addresses a critical gap in prior resources, which lacked dense, paired coverage of dynamic scenes sampled freely across both axes, and is intended for training and evaluating models on tasks such as re-timing, bullet-time, arbitrary camera manipulation, and space–time disentanglement (Huang et al., 31 Dec 2025).
1. Motivation and Conceptual Foundations
Cam×Time was motivated by the absence of datasets supporting dense, paired supervision over dynamic scenes when sampled along both the camera (spatial) and animation (temporal) axes. Prior datasets, including ReCamMaster, SynCamMaster, and Kubric-4D, lacked the capability for arbitrary sampling in joint space and time; they typically provided only monotonic time sequences, or sparse coverage of the pose grid per sequence. Cam×Time was explicitly constructed to enable "full-coverage" rendering—a complete F × F grid, with F = 120 frames—so that any modeling algorithm can learn and benchmark on the disentangling of camera pose (c_f, the extrinsic at frame f) from animation time (t_j, the source frame index within a motion sequence). Typical tasks supported include controllable video manipulation scenarios such as re-timing, slow-motion, freeze-frames, and independent camera path generation (Huang et al., 31 Dec 2025).
2. Scene Composition and Assets
Cam×Time includes 100 distinct photorealistic environments sourced from commercially-licensed 3D asset packs, covering both indoor and outdoor scenarios. Each environment hosts approximately five unique character animations, totaling 500 animations, which are assembled from Mixamo and the HUMOTO motion-capture dataset [Lu_2025_HUMOTO]. Materials are physically based (PBR), with manual refinement to enhance realism—such as correct reflectance, subsurface scattering, and cloth details. Lighting setups per environment use static HDR maps augmented by fill lights, ensuring the main animated subject is consistently well-lit and shadowed throughout each sequence. Scene complexity ranges from single-character locomotion to multi-actor object interactions and detailed gesture sequences. This variety provides the visual diversity and motion complexity required to benchmark general space–time disentanglement (Huang et al., 31 Dec 2025).
3. Camera Trajectories and Space–Time Grid Structure
For each animation within each scene, three (four in the appendix) distinct camera trajectories are defined, each spanning F = 120 frames:
- Rotational Orbits: 360° horizontal paths encircling the animated subject.
- Dolly Tracks: Linear translation of the camera.
- Bézier-Style Arcs: Smooth, parametrically defined combinations of pan, tilt, and translation.
Every camera path is represented as a sequence of extrinsic matrices, , forming a trajectory tensor . For each pair —i.e., for every desired camera pose at every animation time—the renderer generates a corresponding image . This exhaustive cross-product constructs an F × F grid (specifically, 120 × 120 per animation / path), yielding dense, uniform coverage of camera pose and animation time combinations. Arbitrary sampling of this grid during training and evaluation allows for diverse benchmarking protocols, including extraction of "diagonal" trajectories (standard monotonic videos), non-monotonic re-timed sequences, and spatial warps (Huang et al., 31 Dec 2025).
4. Temporal Sampling and Warping Mechanisms
Animation time in Cam×Time is parametrized by for each animation/trajectory. The dataset supports non-linear temporal manipulations through warping functions , enabling construction of target videos as
where and . Such augmentations allow generated videos to feature temporal phenomena beyond linear playback, such as:
- Reverse playback:
- Slow motion: segmentally reduced rate by repeated frame sampling
- Freeze effects: for constant
- Zig-zag or accelerated motion
This flexibility teaches diffusion models to fully decouple content originating from animation time from the current camera pose—a core challenge in generative video modeling. Arbitrary (source, target) subsequences can thus be extracted efficiently from the rendered space–time grid (Huang et al., 31 Dec 2025).
5. Dataset Statistics and Structure
Cam×Time achieves substantial scale and coverage as summarized below:
| Statistic | Value/Description | Note |
|---|---|---|
| Scenes | 100 | Photorealistic, diverse |
| Animations | 500 (≈5 per scene) | Mixamo + HUMOTO sourced |
| Camera paths/animation | 3 (main paper), 4 (appendix) | Multiple trajectory archetypes |
| Frames/path | 120 | Uniform grid |
| Total "diagonal" videos | 180,000 | |
| Frames/video | 120 | Each video is a monotonic path through grid |
| Image resolution | 1080 × 1080 px | PNG sequences |
| Data size | ~1 TB | 180,000 videos at 1080p |
| License | Academic/research-only | Commercial use restricted |
| Test split | Reserved from full set | For benchmarking |
Rendered sequences are provided as PNG images (per-frame), with corresponding JSON or NumPy files storing trajectory metadata () and temporal index arrays (). Depth images, segmentation masks, and optical flow are not explicitly announced as part of the official release, but are "trivial to export" at render time, given Blender provenance (Huang et al., 31 Dec 2025).
6. Rendering Pipeline and Validity Criteria
Rendering employs Blender (engine unspecified; plausibly either Cycles or Eevee), operating with high-fidelity, anti-aliased settings. Static HDRI environments and additional fill lighting guarantee subject visibility and shadow coherence. The pipeline incorporates automated validity checks to ensure camera paths neither intersect meshes nor cause subjects to leave the frame. Frames/trajectories violating these constraints are excluded. The dataset was rendered on a proprietary Adobe/Blender compute cluster, with further parameters such as samples-per-pixel and render times not disclosed (Huang et al., 31 Dec 2025).
7. Significance, Use Cases, and Access
Cam×Time enables—for the first time—the training and robust benchmarking of controllable video diffusion models with independent navigation along spatial and temporal axes. The dense, paired structure is critical for tasks including, but not limited to: bullet-time effect synthesis, free-viewpoint re-timing, scene retargeting, and rigorous evaluation of space–time disentanglement. It serves as a designated benchmark split for the SpaceTimePilot model and, by extension, future research on controllable scene rendering. The dataset and associated code are scheduled for public release at https://zheninghuang.github.io/Space-Time-Pilot/, under a research-only license. Approximately 1 TB is required for full download; a portion of the environments and animations is reserved as a standard evaluation benchmark to ensure comparability in future work (Huang et al., 31 Dec 2025).