OnlineSplatter: Real-Time Pose-Free 3D Reconstruction
- OnlineSplatter is a feed-forward framework that reconstructs high-fidelity 3D scenes using Gaussian primitives without requiring camera pose information.
- It employs a dense Gaussian primitive field representation with constant runtime and memory per frame, ensuring scalability for long video sequences.
- The approach integrates a dual-key memory module for temporal aggregation, enabling robust, pose-free 3D reconstruction even in dynamic environments.
OnlineSplatter refers to a class of online, feed-forward frameworks for real-time 3D reconstruction from unconstrained video, with particular focus on representing objects or scenes as fields of 3D Gaussian primitives. “OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects” establishes OnlineSplatter as the first approach to produce high-fidelity, object-centric, pose-free 3D Gaussian fields directly from RGB streams, without any requirement for camera poses, depths, or bundle adjustment, and with constant-cost online operation throughout an input video (Huang et al., 23 Oct 2025). The paradigm has been extended by frameworks such as LongSplat (Huang et al., 22 Jul 2025) and StreamSplat (Wu et al., 10 Jun 2025), each addressing distinct reconstruction challenges in long video, static scenes, or dynamic environments.
1. Problem Definition and Objectives
OnlineSplatter targets the reconstruction of freely moving rigid objects or entire dynamic scenes from monocular video streams . The critical regime addressed is pose-free: no ground-truth or estimated camera pose is required, nor are explicit depth maps provided. Instead, an off-the-shelf video segmentation (OVS) module yields per-frame object masks . At every time step , the output is a canonical, object-centric representation —encoded as a structured set of 3D Gaussian primitives—capable of supporting high-fidelity novel-view rendering.
The “online” property denotes that every new frame is processed in a single feed-forward pass, with no iterative or global optimization, and with constant runtime and memory cost per frame irrespective of video length. The principal training objective is a sum of photometric loss in the masked object region, a geometric penalty for alignment with camera rays and depth consistency, and regularization for stray Gaussians: where denotes differentiable splatting-based rendering, is the masked MSE with background penalty, and enforces Gaussian-ray alignment and depth consistency (Huang et al., 23 Oct 2025).
2. Representation: Dense Gaussian Primitive Fields
OnlineSplatter represents the 3D scene or object as a fixed-size union of Gaussian splats , with per group and each primitive parameterized by center , anisotropic covariance , and a weight controlling color/opacity.
The continuous field is: At , the initial set is decoded from the first RGB frame; every subsequent frame updates and refines this field, maintaining exactly $4N$ primitives to avoid unconstrained memory/computational growth. This design delivers constant update and rendering cost, while the explicit Gaussian parametrization supports efficient differentiable rendering and memory-friendly encoding (Huang et al., 23 Oct 2025).
3. Dual-Key Memory Module and Temporal Aggregation
A critical challenge is robust fusion of current-frame features and long-term object state when explicit pose estimation is not performed. OnlineSplatter introduces a dual-key memory bank, storing up to entries of the form . Here, encodes latent appearance-geometry cues, while captures the viewing direction. Values are embedding tokens. Memory is kept compact by a sparsification protocol: when the cap is reached, 20% of tokens are pruned via spatial coverage and usage criteria.
Spatial-guided memory readout enables the system to combine orientation-aligned () and complementary () features with softmax-weighted attention over dual-key queries: Resulting features are used for the current Gaussian refinement; this dual-key design anchors the canonical object coordinate system and allows consistent online update, even as the object undergoes free motion (Huang et al., 23 Oct 2025).
4. Online Feed-Forward Pipeline and Workflow
The complete OnlineSplatter pipeline maintains a memory state and dense Gaussian parameters that are updated per frame via a single feed-forward transformer. The pipeline can be summarized as:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Initialize memory M ← ∅ Encode V0 → f_v0 → T_ref^in → transformer → G_ref,0^N Build initial memory entry from t=0 for t = 1 to T: V_t → mask M_t → encode → f_vt encode keys k_t^L, k_t^D read memory → T_mem,t^in = {f_mem,align, f_mem,comp} collect inputs T_ref^in, T_src,t^in, T_mem,t^in transformer → outputs → Unpatchify → G_mem,t^{2N}, G_ref,t^N, G_src,t^N use full G_obj,t^{4N} for rendering/training encode new value v_t^L, append (k_t^L,k_t^D,v_t^L) to M if |M|>cap: prune 20 % by sparsification end |
Both runtime and memory remain constant for each frame, ensured by serving all computation through the fixed-size dual-key memory and Gaussian set, regardless of sequence length (Huang et al., 23 Oct 2025).
5. Computational Complexity and Scalability
The transformer attends to a fixed-size set of patch tokens and the memory bank (with size determined by image resolution and the memory budget). Gaussian decode and render always operate on $4N$ primitives. The overall per-frame computational complexity is , with neither runtime nor memory demand increasing with video length—a property not shared by conventional per-frame accumulation or global bundle adjustment schemes (Huang et al., 23 Oct 2025).
6. Experimental Evaluation and Quantitative Comparison
OnlineSplatter was evaluated on both synthetic (GSO: Google Scanned Objects) and challenging real (HO3D: hand-object monocular interaction) datasets. Comparative metrics (PSNR , SSIM , LPIPS ) are summarized:
| Method | GSO Early | GSO Mid | GSO Late | HO3D Early | HO3D Late |
|---|---|---|---|---|---|
| FSO_rand4 | 21.36/0.861/0.177 | 21.92/0.877/0.181 | 21.74/0.855/0.181 | 18.49/0.820/0.187 | ... |
| FSO_dist4 | 22.37/0.874/0.119 | 23.76/0.862/0.117 | 23.75/0.873/0.120 | 18.59/0.837/0.177 | ... |
| NPS_dist2 | 22.99/0.859/0.155 | 23.05/0.863/0.162 | 22.95/0.878/0.156 | 21.06/0.855/0.160 | ... |
| NPS_dist3 | 23.33/0.862/0.149 | 23.21/0.861/0.138 | 24.14/0.863/0.125 | 21.13/0.853/0.162 | ... |
| OnlineSplatter | 26.33/0.921/0.084 | 27.55/0.933/0.066 | 31.74/0.969/0.075 | 23.63/0.910/0.152 | 27.93/0.952/0.099 |
OnlineSplatter consistently outperforms pose-free and online-adapted baselines, with improvement growing as more observations are incorporated. Qualitative results demonstrate progressive emergence of crisp object-centric geometry through the sequence (Huang et al., 23 Oct 2025).
7. Limitations and Future Directions
OnlineSplatter is currently restricted to rigid objects, outputting a 3D Gaussian Splatting (3DGS) field, which is not directly convertible to a mesh representation—robust GS→mesh conversion remains an open challenge. The system relies on the quality of the first frame to establish the anchor coordinate system, with significant occlusion or image degradation in the initial frame resulting in slower convergence. While efficient for real-time rendering, the reliance on rigid geometry limits applicability to articulated or deformable objects. Potential extensions include hybrid representations (combining GS and explicit surfaces), improved GS-to-mesh algorithms, and integration with robotics or AR task pipelines (Huang et al., 23 Oct 2025).
8. Relation to Broader Online 3DGS Reconstruction Paradigms
OnlineSplatter is differentiated from prior work by its exclusive reliance on feed-forward, pose-free inference and strict memory/runtime bounding. In comparison, frameworks such as LongSplat (Huang et al., 22 Jul 2025) process static scenes with explicit pose supervision and achieve online memory-bounded operation by integrating a Gaussian-Image Representation (GIR) for per-view fusion and compression, with up to 44% reduction in the number of Gaussians while sacrificing less than 0.2 dB PSNR. However, LongSplat requires known camera intrinsics/extrinsics and is limited to static scenes.
For dynamic scenes, StreamSplat (Wu et al., 10 Jun 2025) introduces fully feed-forward, probabilistic 3DGS encoding and bidirectional deformation for dynamic scene modeling and supports uncalibrated input. StreamSplat achieves temporally coherent dynamic reconstructions and interpolation by leveraging a probabilistic static encoder and opacity-matched adaptive fusion, demonstrating competitive quality and runtime.
A plausible implication is that OnlineSplatter establishes a scalable foundation for pose-free, real-time, memory-constant object-centric 3D reconstruction, which could be further specialized or generalized by integrating adaptive deformation, hybrid canonical representations, or explicit dynamic modeling as seen in other contemporary work.