Cam×Time Dataset for Space-Time Modeling

Updated 1 January 2026

Cam×Time dataset is a synthetic, grid-structured video collection designed to benchmark space–time disentanglement in generative models.
It employs full-coverage rendering with a 120×120 frame grid and distinct camera trajectories to enable precise control over spatial and temporal dynamics.
The dataset supports tasks like re-timing, bullet-time effects, and arbitrary camera manipulation, advancing research in controllable video diffusion.

The Cam×Time dataset is a large-scale, synthetic, grid-structured collection of photorealistic videos designed explicitly for disentangling camera (space) and animation (time) control in generative video modeling. Developed as part of the SpaceTimePilot framework, Cam×Time is the first dataset to provide full-coverage supervision over both camera trajectories and temporal progressions, enabling benchmarking and development of video diffusion models that perform independent and explicit control over spatial and temporal dynamics. The dataset addresses a critical gap in prior resources, which lacked dense, paired coverage of dynamic scenes sampled freely across both axes, and is intended for training and evaluating models on tasks such as re-timing, bullet-time, arbitrary camera manipulation, and space–time disentanglement (Huang et al., 31 Dec 2025).

1. Motivation and Conceptual Foundations

Cam×Time was motivated by the absence of datasets supporting dense, paired supervision over dynamic scenes when sampled along both the camera (spatial) and animation (temporal) axes. Prior datasets, including ReCamMaster, SynCamMaster, and Kubric-4D, lacked the capability for arbitrary sampling in joint space and time; they typically provided only monotonic time sequences, or sparse coverage of the pose grid per sequence. Cam×Time was explicitly constructed to enable "full-coverage" rendering—a complete F × F grid, with F = 120 frames—so that any modeling algorithm can learn and benchmark on the disentangling of camera pose (c_f, the extrinsic at frame f) from animation time (t_j, the source frame index within a motion sequence). Typical tasks supported include controllable video manipulation scenarios such as re-timing, slow-motion, freeze-frames, and independent camera path generation (Huang et al., 31 Dec 2025).

2. Scene Composition and Assets

Cam×Time includes 100 distinct photorealistic environments sourced from commercially-licensed 3D asset packs, covering both indoor and outdoor scenarios. Each environment hosts approximately five unique character animations, totaling 500 animations, which are assembled from Mixamo and the HUMOTO motion-capture dataset [Lu_2025_HUMOTO]. Materials are physically based (PBR), with manual refinement to enhance realism—such as correct reflectance, subsurface scattering, and cloth details. Lighting setups per environment use static HDR maps augmented by fill lights, ensuring the main animated subject is consistently well-lit and shadowed throughout each sequence. Scene complexity ranges from single-character locomotion to multi-actor object interactions and detailed gesture sequences. This variety provides the visual diversity and motion complexity required to benchmark general space–time disentanglement (Huang et al., 31 Dec 2025).

3. Camera Trajectories and Space–Time Grid Structure

For each animation within each scene, three (four in the appendix) distinct camera trajectories are defined, each spanning F = 120 frames:

Rotational Orbits: 360° horizontal paths encircling the animated subject.
Dolly Tracks: Linear translation of the camera.
Bézier-Style Arcs: Smooth, parametrically defined combinations of pan, tilt, and translation.

Every camera path is represented as a sequence of extrinsic matrices, $c = [c_1, ..., c_F],\ c_f \in \mathbb{R}^{3 \times 4}$ , forming a trajectory tensor $c \in \mathbb{R}^{F \times 3 \times 4}$ . For each pair $(c_f, t_j)$ —i.e., for every desired camera pose at every animation time—the renderer generates a corresponding image $I(c_f, t_j)$ . This exhaustive cross-product constructs an F × F grid (specifically, 120 × 120 per animation / path), yielding dense, uniform coverage of camera pose and animation time combinations. Arbitrary sampling of this grid during training and evaluation allows for diverse benchmarking protocols, including extraction of "diagonal" trajectories (standard monotonic videos), non-monotonic re-timed sequences, and spatial warps (Huang et al., 31 Dec 2025).

4. Temporal Sampling and Warping Mechanisms

Animation time in Cam×Time is parametrized by $t \in \{1, 2, \ldots, 120\}$ for each animation/trajectory. The dataset supports non-linear temporal manipulations through warping functions $\tau: [1,F] \rightarrow [1,F]$ , enabling construction of target videos as

$V'_{trg}[f] = I_{trg}^{\tau(f)},$

where $t_{src} = [1, 2, ..., F]$ and $t_{trg} = \tau(t_{src})$ . Such augmentations allow generated videos to feature temporal phenomena beyond linear playback, such as:

Reverse playback: $\tau(f) = F - f + 1$
Slow motion: segmentally reduced rate by repeated frame sampling
Freeze effects: $\tau(f) = k$ for constant $k$
Zig-zag or accelerated motion

This flexibility teaches diffusion models to fully decouple content originating from animation time from the current camera pose—a core challenge in generative video modeling. Arbitrary (source, target) subsequences can thus be extracted efficiently from the rendered space–time grid (Huang et al., 31 Dec 2025).

5. Dataset Statistics and Structure

Cam×Time achieves substantial scale and coverage as summarized below:

Statistic	Value/Description	Note
Scenes	100	Photorealistic, diverse
Animations	500 (≈5 per scene)	Mixamo + HUMOTO sourced
Camera paths/animation	3 (main paper), 4 (appendix)	Multiple trajectory archetypes
Frames/path	120	Uniform grid
Total "diagonal" videos	180,000	$500 \times 3 \times 120$
Frames/video	120	Each video is a monotonic path through grid
Image resolution	1080 × 1080 px	PNG sequences
Data size	~1 TB	180,000 videos at 1080p
License	Academic/research-only	Commercial use restricted
Test split	Reserved from full set	For benchmarking

Rendered sequences are provided as PNG images (per-frame), with corresponding JSON or NumPy files storing trajectory metadata ( $c \in \mathbb{R}^{F \times 3 \times 4}$ ) and temporal index arrays ( $t_{src}, t_{trg} \in \mathbb{R}^F$ ). Depth images, segmentation masks, and optical flow are not explicitly announced as part of the official release, but are "trivial to export" at render time, given Blender provenance (Huang et al., 31 Dec 2025).

6. Rendering Pipeline and Validity Criteria

Rendering employs Blender (engine unspecified; plausibly either Cycles or Eevee), operating with high-fidelity, anti-aliased settings. Static HDRI environments and additional fill lighting guarantee subject visibility and shadow coherence. The pipeline incorporates automated validity checks to ensure camera paths neither intersect meshes nor cause subjects to leave the frame. Frames/trajectories violating these constraints are excluded. The dataset was rendered on a proprietary Adobe/Blender compute cluster, with further parameters such as samples-per-pixel and render times not disclosed (Huang et al., 31 Dec 2025).

7. Significance, Use Cases, and Access

Cam×Time enables—for the first time—the training and robust benchmarking of controllable video diffusion models with independent navigation along spatial and temporal axes. The dense, paired structure is critical for tasks including, but not limited to: bullet-time effect synthesis, free-viewpoint re-timing, scene retargeting, and rigorous evaluation of space–time disentanglement. It serves as a designated benchmark split for the SpaceTimePilot model and, by extension, future research on controllable scene rendering. The dataset and associated code are scheduled for public release at https://zheninghuang.github.io/Space-Time-Pilot/, under a research-only license. Approximately 1 TB is required for full download; a portion of the environments and animations is reserved as a standard evaluation benchmark to ensure comparability in future work (Huang et al., 31 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CamxTime Dataset.