- The paper introduces an optical flow-guided dynamic SLAM that segments and tracks moving objects without relying on predetermined categories using Camera-Induced Motion Decomposition.
- It proposes a hybrid 4D Gaussian representation that fuses static and dynamic elements, ensuring temporal coherence and efficient, real-time pose optimization.
- Empirical results on TUM RGB-D and BONN datasets show improved tracking accuracy and enhanced rendering fidelity, setting a new benchmark for dynamic SLAM.
Optical Flow-Guided 4D Gaussian Splatting for Dynamic SLAM
Introduction
The integration of 3D Gaussian Splatting (3DGS) with Visual SLAM has shown promise for real-time, photorealistic mapping, but existing methods are constrained to static or category-dependent dynamic elements. The paper "Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM" (2604.22339) proposes a framework that utilizes optical flow to achieve efficient, category-agnostic dynamic SLAM with explicit 4D Gaussian scene representation. The method addresses the robust segmentation of dynamic elements, scene modeling, and accelerated training for online SLAM in complex dynamic environments.
Framework Overview
Flow4DGS-SLAM components include a Camera-Induced Motion Decomposition (CIMD) for category-agnostic motion segmentation, a hybrid 4D Gaussian representation for modeling both static and dynamic regions, and flow-guided modules for Gaussian propagation and insertion. The overall system enables robust camera tracking and efficient mapping even under severe dynamics, segmenting dynamic from static elements without reliance on predefined categories.
Figure 1: The Flow4DGS-SLAM pipeline, showing RGB-D input processing, camera-induced motion decomposition, dynamic/static Gaussian branch, and online training loop.
Camera-Induced Motion Decomposition
CIMD leverages the spatial correspondence between depth maps and dense optical flow fields. By estimating camera ego-motion and modeling the rigid flow, static pixel motion is predicted, and per-pixel residuals with observed flow enable the robust separation of dynamic pixels. The process involves:
- Fitting a 6-DoF camera motion to align predicted and observed flow on static regions.
- Defining a motion mask by thresholding the flow residuals, producing category-agnostic segmentation.
- Fusing the mask with semantic segmentation to avoid residual dynamic distractors.
- Initializing the pose for subsequent tracking and optimizing the camera using inlier pixels.
This enables consistent, robust pose estimation in the presence of arbitrary dynamic objects and alleviates failure modes faced by category-constrained models.
Hybrid 4D Gaussian Splatting
Dynamic Gaussians are represented as explicit keyframe-based 3D centers and continuous temporal attributes (opacity, rotation) modeled by a GMM. This hybridization provides several capabilities:
- Explicitly-learned Gaussian trajectories ensure temporal coherence and tractable updates at keyframes.
- Scene flow is calculated via optical flow and depth prediction, regularized with KNN smoothing for spatial rigidity.
- Temporal opacity and rotation are parameterized via a GMM, capturing appearance evolution without timestamp-specific storage and allowing smooth interpolation.
The static part of the map continues to be modeled by conventional 3DGS, allowing the fusion of static constraints for pose optimization.
Optical Flow-Guided 4D Mapping
For efficient dynamic mapping, two modules are introduced:
- Scene Flow Propagation: Previously-estimated dynamic Gaussian centers are propagated using projected flow and depth to match their expected positions at the new keyframe, regularized with KNN smoothing for stability and local coherence.
- Adaptive Gaussian Insertion: New dynamic regions detected via flow-based mask differencing and back-projection trigger the initialization of new dynamic Gaussians, ensuring coverage of newly appearing or reoccluded dynamics.
This approach accelerates convergence and mitigates the need for extensive optimization cycles, as in prior dynamic SLAM methods.
Tracking and Mapping Optimization
Pose tracking iteratively aligns rendered color and depth from the static map to the observations, masking out the dynamic regions as defined by the fused motion mask. The mapping step jointly optimizes color, depth, correspondence with flow, mask consistency, and isotropy regularization via a composite objective. These contributions stabilize pose and provide accurate photometric reconstruction.
Empirical Evaluation
Extensive experiments are conducted on TUM RGB-D and BONN datasets, benchmarking against both static and dynamic SLAM baselines. The key empirical findings include:
- Superior tracking accuracy on challenging dynamic sequences: e.g., average ATE RMSE of 1.9 cm (TUM) and 3.5 cm (BONN), outperforming 4DGS-SLAM and classical/static SLAM variants.
- Substantial improvements in rendering fidelity, with global PSNR/SSIM/LPIPS metrics exceeding prior work by 2–4 dB average and reducing distortion in dynamic regions.
- Qualitative renderings demonstrate robust spatio-temporal coherence and superior reconstruction detail in highly dynamic scenarios, including occlusion/disocclusion and category-agnostic motion.















Figure 2: Non-keyframe renderings illustrating improved appearance consistency and dynamic modeling in TUM RGB-D and BONN datasets.
Tracking ablation further shows that each module (motion decomposition, flow propagation, adaptive insertion, GMM modeling) yields measurable improvements in both pose and appearance error. The flow-based segmentation and initialization are pivotal for handling unseen or fast motion.







Figure 3: Tracking visualizations on fast-moving BONN scenes, highlighting the improved robustness due to category-agnostic motion decomposition and pose initialization.
Theoretical and Practical Implications
The proposed decomposition-based segmentation enables robust, generalizable dynamic SLAM without reliance on static semantic categories or prohibitively expensive MLP-based deformation fields. The hybrid Gaussian representation supports efficient, real-time operability, favoring explicit geometric reasoning and fast incremental learning. This model provides a template for scaling neural implicit representations to unconstrained, online, and dense mapping scenarios.
Practically, Flow4DGS-SLAM opens SLAM-based scene understanding and relocalization to highly dynamic and unconstrained environments, applicable to robotics, augmented reality, and autonomous navigation where dynamic occlusion and motion are prevalent.
Future Directions
Several immediate research extensions are implied:
- Integration of online loop closure in the dynamic SLAM pipeline.
- Extension to purely monocular, sensor-invariant, or multi-camera SLAM settings.
- Adaptive GMM complexity control for balancing fidelity and efficiency.
- Inductive transfer to unconstrained outdoor or large-scale environments.
- Fusion with self-supervised sparse correspondences or event-based sensing modalities.
Conclusion
Flow4DGS-SLAM advances dynamic SLAM by introducing a category-agnostic, optical flow-guided framework for 4D Gaussian Splatting. It achieves robust camera tracking, high-fidelity dynamic scene modeling, and online mapping efficiency orders-of-magnitude faster than prior dynamic SLAM methods, without category or scenario constraints. The explicit, hybrid scene representation and flow-driven optimization paradigm establish new baselines for dynamic neural SLAM and signal promising avenues for robust scene understanding in challenging environments (2604.22339).