Freetime FeatureGS: Dynamic 4D Scene Reconstruction

Updated 4 January 2026

Freetime FeatureGS is a 4D scene representation that decomposes dynamic scenes into Gaussian primitives with learnable instance features.
It integrates linear spatio–temporal motion, contrastive instance feature learning, and streaming optimization to ensure robust segmentation.
Empirical evaluations on multi-view dynamic video benchmarks demonstrate state-of-the-art mIoU and recall compared to tracker-based methods.

Freetime FeatureGS is a dynamic 4D scene representation for decomposed reconstruction, extending Gaussian Splatting frameworks to fully spatio–temporal modeling with consistent instance feature learning. It unifies per-instance feature propagation, linear spatio–temporal motion, and temporally ordered streaming optimization to achieve robust 4D segmentation without reliance on explicit video or tracker-based segmentation, yielding state-of-the-art performance on dynamic multi-view video benchmarks (Hu et al., 28 Dec 2025).

1. Representation: 4D Gaussian Primitives with Instance Features

Freetime FeatureGS models a dynamic scene as a set of $N$ four-dimensional Gaussian primitives: $\{g_i \mid i = 1, \ldots, N\}$ where each primitive $g_i$ is parameterized by: $g_i = \Bigl(\mu_i,\, s_i,\, \alpha_i,\, q_{l,i},\, q_{r,i},\, h_i,\, v_i,\, f_i\Bigr)$

$\mu_i \in \mathbb{R}^4$ : Spatio–temporal center $(x, y, z, t)$
$s_i \in \mathbb{R}^4$ : 4D scale
$\alpha_i \in \mathbb{R}$ : Peak opacity
$q_{l,i}, q_{r,i} \in \mathbb{R}^4$ : View-dependent appearance quaternions
$h_i \in \mathbb{R}^m$ : Spherical-harmonic color coefficients
$v_i \in \mathbb{R}^3$ : Constant spatial velocity
$f_i \in \mathbb{R}^d$ : Learnable instance feature vector

The motion law enabling 4D placement and linear trajectory is: $\mu_{x,i}(t) = \mu_{x,i} + v_i\, (t - \mu_{t,i})$ At each observation, rendering into a 2D feature map via differentiable splatting utilizes the following depth-sorted compositing: $F_s(u) = \sum_{i \in \mathcal{N}(u)} f_i\, \alpha_i\, T_i \quad \text{where} \quad T_i = \prod_{j \in \mathcal{N}(u),\, j < i} (1 - \alpha_j)$ This mechanism extends directly to prediction of density or radiance by substituting $f_i$ with the relevant attribute.

2. Instance Feature Learning and Contrastive Loss

For each image $(s, t)$ with segmentation mask $M_{s,t}$ containing $N_\mathrm{inst}$ masks:

$N_s$ pixels are sampled per mask.
The rendered primitive features $\{f_i^j\}_{j=1}^{N_s}$ define the instance feature center:

$\bar{f}_i = \frac{1}{N_s} \sum_{j=1}^{N_s} f_i^j$

With per-instance temperature:

$\phi_i = \frac{1}{N_s \log (N_s + 10)} \sum_{j=1}^{N_s} \| f_i^j - \bar{f}_i \|^2$

The instance contrastive clustering loss is:

$\mathcal{L}_\mathrm{CC} = -\frac{1}{N_\mathrm{inst} N_s} \sum_{i=1}^{N_\mathrm{inst}} \sum_{j=1}^{N_s} \log \frac{ \exp\left( f_i^j \cdot \bar{f}_i / \phi_i \right) }{ \sum_{k=1}^{N_\mathrm{inst}} \exp\left( f_i^j \cdot \bar{f}_k / \phi_k \right) }$

This InfoNCE-based loss structure enforces that features within the same 2D mask are pulled together, while features across different masks are pushed apart, promoting instance-level feature clustering that is temporally stable.

3. Streaming Optimization Over Temporally Ordered Observations

Freetime FeatureGS employs a streaming feature learning strategy:

All training observations $\{(s, t)\}$ are sorted into a temporally ordered stream $\mathcal{S}$ .
Instead of uniform random batches, consecutive temporal minibatches are used. Given only linear motion between adjacent times, this sequential sampling ensures primitive feature propagation over time, preventing temporal fragmentation or collapse to distinct clusters for the same physical instance at different timepoints.
During streaming updates, underlying geometry remains fixed; only the instance features $\{f_i\}$ are updated, controlled by Adam optimization on the sum of contrastive and regularization losses for primitives present in the current batch:

$f_i \leftarrow f_i - \eta\, \frac{\partial (\mathcal{L}_\mathrm{CC} + \cdots)}{\partial f_i}$

This approach strongly disincentivizes local minima that would otherwise hamper the learning of temporally consistent features, even in the presence of unstable 2D segmentations.

4. 4D Segmentation Inference: Initialization and Temporal Label Propagation

After training:

In the initial frame $(s, t=1)$ , 2–10% of participating Gaussians are randomly sub-sampled and clustered (e.g., with HDBSCAN) in the feature space $\{f_i\}$ to produce $K$ cluster centers as instance labels.
For later times $t > 1$ , each primitive $g_i$ inherits the label of the closest cluster in feature space.
An optional filtering step rejects outliers based on deviations in velocity, position, and low feature similarity to the assigned cluster:

$\mathrm{Filter}(g_i) = \mathbb{I}\Bigl( \Delta v_i > 3\sigma_v \;\wedge\; \Delta p_i > 3\sigma_p \;\wedge\; \mathrm{sim}(f_i, \bar{f}_i) < \tau_\mathrm{sim} \Bigr)$

This procedure ensures robust, temporally coherent instance labeling that closely aligns with physical object tracks.

5. Empirical Results and Comparative Analysis

Evaluations span Neural3DV, Multi-Human, and SelfCap datasets, each containing 19–24 views and 60–300 frames with significant motion and interaction (Hu et al., 28 Dec 2025). Principal baselines include SA4D (video-tracker-based), SADG (deformable 4DGS), OmniSeg3D (static contrastive lift), and DGD (DINOv2-distilled).

Key quantitative outcomes (mIoU/Recall_dyn):

Dataset	Freetime FeatureGS	SA4D	SADG
Neural3DV	0.801 / 0.833	0.668 / 0.812	0.649 / 0.813
Multi-Human	0.893 / 0.942	0.592 / 0.856	0.696 / 0.727
SelfCap	0.882 / 0.926	0.688 / 0.819	0.723 / 0.860

Ablation studies on the Juggle sequence highlight the importance of each design choice:

Removing the motion model drops mIoU to 0.644.
Removing streaming sampling yields mIoU of 0.552.
Removing regularization terms reduces but does not eliminate performance gains (mIoU 0.858–0.866).
The full model attains mIoU 0.895.

Qualitative results emphasize sharper contact boundaries (e.g., in hand–object interactions), temporal coherence despite fast motion and occlusion, and robustness to 2D mask noise or unreliable tracker supervision.

6. Broader Context: Relation to FreeTimeGS and Prior Gaussian Splatting Advances

Freetime FeatureGS extends and specializes the FreeTimeGS framework, which itself generalizes 3D Gaussian Splatting (3DGS) to unconstrained spatio–temporal support with linearly parameterized primitive motion (Wang et al., 5 Jun 2025). Unlike canonical deformation-field-based approaches, FreeTimeGS and by extension Freetime FeatureGS assign each primitive not only spatial but also temporal centers, velocity, and temporal window (duration), allowing adaptation to arbitrary object trajectories and minimizing temporal redundancy. FeatureGS, in static settings, demonstrated the effectiveness of per-primitive learnable features for geometric accuracy (Jäger et al., 29 Jan 2025); Freetime FeatureGS harnesses an analogous idea, but in a dynamic, temporally extended domain, focusing on instance semantics and 4D segmentation.

A plausible implication is that the Freetime FeatureGS paradigm enables a unified instance segmentation, tracking, and reconstruction pipeline at 4D spatio–temporal granularity, with competitive or superior accuracy and robustness compared to explicitly tracker-based or video-segmentation-dependent methods.

7. Significance and Practical Considerations

Freetime FeatureGS's ability to learn temporally consistent, low-dimensional instance features without reliance on video segmentation marks a shift in dynamic 3D/4D scene analysis:

It advances the state-of-the-art for decomposed 4D reconstruction in terms of both mIoU and recall.
The approach exhibits resilience to noisy and unstable low-level segmentations, instead leveraging contrastive learning and physically motivated motion priors.
Detailed ablation confirms that streaming optimization and explicit linear motion are central for maintaining temporal coherence.
The model's robustness is particularly evident in challenging scenarios (rapid motion, object contact, occlusion).

These characteristics position Freetime FeatureGS as a foundational technique for future research in dynamic scene understanding and instance-level 4D reconstruction (Hu et al., 28 Dec 2025).