Papers
Topics
Authors
Recent
Search
2000 character limit reached

SyncTrack4D: 4D Gaussian Splatting Pipeline

Updated 6 December 2025
  • SyncTrack4D is a pipeline for synchronized multi-view 4D Gaussian Splatting that aligns unsynchronized video sets using dense feature tracking and FGW optimal transport.
  • It integrates dense 4D track extraction, dynamic time warping, and continuous spline-based sub-frame synchronization to achieve precise temporal alignment.
  • Empirical evaluations demonstrate sub-frame synchronization accuracy (error <0.26 frames) and high-quality renderings, validated on both synthetic and real-world datasets.

SyncTrack4D is a pipeline for synchronized multi-view 4D Gaussian Splatting (4DGS) from unsynchronized monocular or multi-view video sets. It couples dense 4D track matching, cross-video temporal alignment, and explicit continuous 4D Gaussian scene representations, enabling high-fidelity dynamic scene reconstruction and sub-frame video synchronization without requiring object templates, prior models, or hardware triggers. The framework formalizes cross-video motion alignment using fused Gromov-Wasserstein (FGW) optimal transport and continuous-time trajectory parameterization to achieve robust sub-frame alignment and rendering of dynamic, real-world scenes (Lee et al., 3 Dec 2025).

1. Multi-Video Input and Pipeline Structure

SyncTrack4D operates on VV unsynchronized videos,

V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},

with known camera intrinsics KvK^v and extrinsics PtvP^v_t. The pipeline consists of the following core stages:

  1. Dense 4D Feature Track Extraction: For each video, 2D pixel tracks (e.g., via SpatialTracker) and optical flows are lifted into initial monocular 4DGS models (based on MoSca). Dense per-video 4D tracks

τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}

(qiv(t)R3q^v_i(t) \in \mathbb{R}^3) are extracted, each with a fixed feature vector fivRFf^v_i \in \mathbb{R}^F (typically DINOv3 descriptors). Additionally, anchor (scaffold) tracks τ^jv\hat{\tau}^v_j are identified for motion compression.

  1. Cross-Video Correspondence via FGW: Dense matching across videos is achieved by casting the problem as FGW optimal transport using both feature similarity and the structure of track geometries.
  2. Coarse Global Frame-Level Temporal Alignment (DTW): Dynamic Time Warping applies to the matched tracks to globally solve for discrete frame offsets per video.
  3. Sub-Frame Synchronization and Unified 4DGS: Continuous-time spline-based parameterization allows accurate fine alignment (sub-frame offsets) and fuses all video tracks into a single synchronized 4DGS scene, optimized via photometric and geometric objectives.

2. Dense 4D Tracks and Fused Gromov-Wasserstein Correspondence

Each per-video 4D track is

τiv={qiv(1),...,qiv(Tv)},qiv(t)R3,\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \},\quad q^v_i(t) \in \mathbb{R}^3,

with a constant descriptor fivRFf^v_i \in \mathbb{R}^F from DINOv3.

Pairwise FGW Matching:

For a pair of videos (reference V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},0, query V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},1):

  • Feature Cost: V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},2.
  • Intra-track Structure: V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},3.

FGW optimal transport solves:

V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},4

with entropic regularization V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},5 and uniform weights V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},6, V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},7. In practice, Sinkhorn-style iterations (e.g., via the POT library) yield V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},8, from which top-V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},9 correspondences are discretized via the Hungarian algorithm.

This formulation leverages both geometric and semantic information, enforcing both per-track similarity and matching of motion/scene structure.

3. Temporal Alignment: Frame-Level and Sub-Frame

Global Frame-Level Alignment:

Introduce integer offsets KvK^v0 per video (KvK^v1 for the reference). Frame-to-frame geometric costs for each matched KvK^v2 are computed as

KvK^v3

where KvK^v4 denotes the matched pairs. Dynamic Time Warping identifies a monotonic mapping between sequences to deduce the optimal shift KvK^v5, usually by selecting the most frequent offset on the DTW path.

Sub-Frame Synchronization and Spline-Based Trajectory Modeling:

To refine synchronization at sub-frame level, each video’s temporal offset KvK^v6 is extended to KvK^v7. Anchor (scaffold) tracks are parameterized as cubic Hermite splines:

KvK^v8

with control points KvK^v9. This enables interpolation of trajectories at non-integer timesteps, facilitating gradient-based optimization of PtvP^v_t0 for sub-frame accuracy.

Leaf Gaussian trajectories PtvP^v_t1 are modeled as linear blends of nearby spline anchors, yielding a globally consistent, smoothly time-varying 4D scene graph.

4. 4D Gaussian Splatting Formulation and Joint Optimization

SyncTrack4D’s final stage constructs a unified multi-video 4DGS, with each Gaussian PtvP^v_t2 defined by:

  • Time-varying mean PtvP^v_t3,
  • Covariance PtvP^v_t4,
  • Color PtvP^v_t5.

Its contribution at PtvP^v_t6:

PtvP^v_t7

Videos are rendered by splatting all PtvP^v_t8 at times PtvP^v_t9 into view τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}0 and comparing outputs τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}1 to ground-truth frames τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}2. The full loss is:

τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}3

where:

  • τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}4: Photometric difference,
  • τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}5: As-rigid-as-possible regularization on the spline scaffold,
  • τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}6: Velocity smoothness,
  • τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}7: Acceleration smoothness.

Optimization variables include Gaussian parameters τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}8, spline controls τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}9, and the sub-frame offsets qiv(t)R3q^v_i(t) \in \mathbb{R}^30, all jointly optimized by Adam or related optimizers.

5. Empirical Evaluation

SyncTrack4D’s evaluation is conducted on:

  • CMU Panoptic Studio: Large-camera array, multi-human activities, providing challenging real-world dynamic scenes.
  • SyncNeRF Blender: Synthetic benchmark with 14 views and multiple objects (Box/Fox/Deer).

Test sequences are unsynchronized, with artificial offsets up to qiv(t)R3q^v_i(t) \in \mathbb{R}^3130 frames.

Results:

  • Synchronization Accuracy: Post-alignment, average error is less than 0.26 frames on Panoptic Studio (improved from over 5 frames before alignment).
  • Novel-View Synthesis:
    • Panoptic Studio: PSNR ≈ 26.3, SSIM ≈ 0.88, LPIPS ≈ 0.14.
    • Outperforms SyncNeRF in both synchronized and unsynchronized conditions.
  • Qualitative Observations: Yields temporally coherent 4D reconstructions with smooth cross-view motion; accurately preserves fine temporal details (e.g., fast object/human motions).
Dataset Synchronization Error (frames) PSNR SSIM LPIPS
Panoptic Studio <0.26 ≈26.3 ≈0.88 ≈0.14
SyncNeRF Blender Not specified

These results demonstrate sub-frame accuracy without hardware synchronization or predefined templates.

6. Technical Significance and Context

SyncTrack4D is, to date, the first general 4D Gaussian Splatting framework tailored for unsynchronized video sets without reliance on prior object models or explicit scene segmentations. The key innovation is leveraging dense 4D feature tracks and FGW optimal transport to drive both correspondence and alignment, enabling robust synchronization and scene consolidation across diverse, real-world scenarios. The method’s coarse-to-fine alignment cascade (DTW and continuous spline optimization) supports sub-frame precision in both frame association and 4DGS parameter estimation. This configuration yields high-fidelity, temporally coherent reconstructions suitable for dynamic scene renderings and further research into multi-view temporal alignment (Lee et al., 3 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SyncTrack4D.