OnlineSplatter: Real-Time Pose-Free 3D Reconstruction

Updated 24 January 2026

OnlineSplatter is a feed-forward framework that reconstructs high-fidelity 3D scenes using Gaussian primitives without requiring camera pose information.
It employs a dense Gaussian primitive field representation with constant runtime and memory per frame, ensuring scalability for long video sequences.
The approach integrates a dual-key memory module for temporal aggregation, enabling robust, pose-free 3D reconstruction even in dynamic environments.

OnlineSplatter refers to a class of online, feed-forward frameworks for real-time 3D reconstruction from unconstrained video, with particular focus on representing objects or scenes as fields of 3D Gaussian primitives. “OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects” establishes OnlineSplatter as the first approach to produce high-fidelity, object-centric, pose-free 3D Gaussian fields directly from RGB streams, without any requirement for camera poses, depths, or bundle adjustment, and with constant-cost online operation throughout an input video (Huang et al., 23 Oct 2025). The paradigm has been extended by frameworks such as LongSplat (Huang et al., 22 Jul 2025) and StreamSplat (Wu et al., 10 Jun 2025), each addressing distinct reconstruction challenges in long video, static scenes, or dynamic environments.

1. Problem Definition and Objectives

OnlineSplatter targets the reconstruction of freely moving rigid objects or entire dynamic scenes from monocular video streams $\{V_t\}_{t=0}^T$ . The critical regime addressed is pose-free: no ground-truth or estimated camera pose is required, nor are explicit depth maps provided. Instead, an off-the-shelf video segmentation (OVS) module yields per-frame object masks $M_t$ . At every time step $t$ , the output is a canonical, object-centric representation $\hat O_t$ —encoded as a structured set of 3D Gaussian primitives—capable of supporting high-fidelity novel-view rendering.

The “online” property denotes that every new frame is processed in a single feed-forward pass, with no iterative or global optimization, and with constant runtime and memory cost per frame irrespective of video length. The principal training objective is a sum of photometric loss in the masked object region, a geometric penalty for alignment with camera rays and depth consistency, and regularization for stray Gaussians: $\mathcal{L}_{\mathrm{total}} = \underbrace{\|\,R(\mathbf{G}_{obj,t}^{4N}) - V_t\|^2_{\mathrm{obj}}}_{\mathcal{L}_{\mathrm{photo}}} +\lambda_{\mathrm{bg}}\,\mathcal{L}_{\mathrm{bg}} +\lambda_g\,\mathcal{L}_{\mathrm{geo}}$ where $R(\cdot)$ denotes differentiable splatting-based rendering, $\mathcal{L}_{\mathrm{photo}}$ is the masked MSE with background penalty, and $\mathcal{L}_{\mathrm{geo}}$ enforces Gaussian-ray alignment and depth consistency (Huang et al., 23 Oct 2025).

2. Representation: Dense Gaussian Primitive Fields

OnlineSplatter represents the 3D scene or object as a fixed-size union of Gaussian splats $\mathbf{G}_{obj,t}^{4N} = \{\mathbf{G}_{mem,t}^{2N}, \mathbf{G}_{ref,t}^N, \mathbf{G}_{src,t}^N\}$ , with $N=H\times W$ per group and each primitive parameterized by center $M_t$ 0, anisotropic covariance $M_t$ 1, and a weight $M_t$ 2 controlling color/opacity.

The continuous field is: $M_t$ 3 At $M_t$ 4, the initial set $M_t$ 5 is decoded from the first RGB frame; every subsequent frame updates and refines this field, maintaining exactly $M_t$ 6 primitives to avoid unconstrained memory/computational growth. This design delivers constant update and rendering cost, while the explicit Gaussian parametrization supports efficient differentiable rendering and memory-friendly encoding (Huang et al., 23 Oct 2025).

3. Dual-Key Memory Module and Temporal Aggregation

A critical challenge is robust fusion of current-frame features and long-term object state when explicit pose estimation is not performed. OnlineSplatter introduces a dual-key memory bank, storing up to $M_t$ 7 entries of the form $M_t$ 8. Here, $M_t$ 9 encodes latent appearance-geometry cues, while $t$ 0 captures the viewing direction. Values $t$ 1 are embedding tokens. Memory is kept compact by a sparsification protocol: when the cap is reached, 20% of tokens are pruned via spatial coverage and usage criteria.

Spatial-guided memory readout enables the system to combine orientation-aligned ( $t$ 2) and complementary ( $t$ 3) features with softmax-weighted attention over dual-key queries: $t$ 4 Resulting features $t$ 5 are used for the current Gaussian refinement; this dual-key design anchors the canonical object coordinate system and allows consistent online update, even as the object undergoes free motion (Huang et al., 23 Oct 2025).

4. Online Feed-Forward Pipeline and Workflow

The complete OnlineSplatter pipeline maintains a memory state and dense Gaussian parameters that are updated per frame via a single feed-forward transformer. The pipeline can be summarized as:

$\hat O_t$ 2

Both runtime and memory remain constant for each frame, ensured by serving all computation through the fixed-size dual-key memory and Gaussian set, regardless of sequence length (Huang et al., 23 Oct 2025).

5. Computational Complexity and Scalability

The transformer attends to a fixed-size set of patch tokens and the memory bank (with size determined by image resolution and the $t$ 6 memory budget). Gaussian decode and render always operate on $t$ 7 primitives. The overall per-frame computational complexity is $t$ 8, with neither runtime nor memory demand increasing with video length—a property not shared by conventional per-frame accumulation or global bundle adjustment schemes (Huang et al., 23 Oct 2025).

6. Experimental Evaluation and Quantitative Comparison

OnlineSplatter was evaluated on both synthetic (GSO: Google Scanned Objects) and challenging real (HO3D: hand-object monocular interaction) datasets. Comparative metrics (PSNR $t$ 9, SSIM $\hat O_t$ 0, LPIPS $\hat O_t$ 1) are summarized:

Method	GSO Early	GSO Mid	GSO Late	HO3D Early	HO3D Late
FSO_rand4	21.36/0.861/0.177	21.92/0.877/0.181	21.74/0.855/0.181	18.49/0.820/0.187	...
FSO_dist4	22.37/0.874/0.119	23.76/0.862/0.117	23.75/0.873/0.120	18.59/0.837/0.177	...
NPS_dist2	22.99/0.859/0.155	23.05/0.863/0.162	22.95/0.878/0.156	21.06/0.855/0.160	...
NPS_dist3	23.33/0.862/0.149	23.21/0.861/0.138	24.14/0.863/0.125	21.13/0.853/0.162	...
OnlineSplatter	26.33/0.921/0.084	27.55/0.933/0.066	31.74/0.969/0.075	23.63/0.910/0.152	27.93/0.952/0.099

OnlineSplatter consistently outperforms pose-free and online-adapted baselines, with improvement growing as more observations are incorporated. Qualitative results demonstrate progressive emergence of crisp object-centric geometry through the sequence (Huang et al., 23 Oct 2025).

7. Limitations and Future Directions

OnlineSplatter is currently restricted to rigid objects, outputting a 3D Gaussian Splatting (3DGS) field, which is not directly convertible to a mesh representation—robust GS→mesh conversion remains an open challenge. The system relies on the quality of the first frame to establish the anchor coordinate system, with significant occlusion or image degradation in the initial frame resulting in slower convergence. While efficient for real-time rendering, the reliance on rigid geometry limits applicability to articulated or deformable objects. Potential extensions include hybrid representations (combining GS and explicit surfaces), improved GS-to-mesh algorithms, and integration with robotics or AR task pipelines (Huang et al., 23 Oct 2025).

8. Relation to Broader Online 3DGS Reconstruction Paradigms

OnlineSplatter is differentiated from prior work by its exclusive reliance on feed-forward, pose-free inference and strict memory/runtime bounding. In comparison, frameworks such as LongSplat (Huang et al., 22 Jul 2025) process static scenes with explicit pose supervision and achieve online memory-bounded operation by integrating a Gaussian-Image Representation (GIR) for per-view fusion and compression, with up to 44% reduction in the number of Gaussians while sacrificing less than 0.2 dB PSNR. However, LongSplat requires known camera intrinsics/extrinsics and is limited to static scenes.

For dynamic scenes, StreamSplat (Wu et al., 10 Jun 2025) introduces fully feed-forward, probabilistic 3DGS encoding and bidirectional deformation for dynamic scene modeling and supports uncalibrated input. StreamSplat achieves temporally coherent dynamic reconstructions and interpolation by leveraging a probabilistic static encoder and opacity-matched adaptive fusion, demonstrating competitive quality and runtime.

A plausible implication is that OnlineSplatter establishes a scalable foundation for pose-free, real-time, memory-constant object-centric 3D reconstruction, which could be further specialized or generalized by integrating adaptive deformation, hybrid canonical representations, or explicit dynamic modeling as seen in other contemporary work.

Markdown Report Issue Upgrade to Chat

References (3)

OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects (2025)

LongSplat: Online Generalizable 3D Gaussian Splatting from Long Sequence Images (2025)

StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OnlineSplatter.