MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

Published 12 Apr 2026 in cs.RO | (2604.10593v1)

Abstract: Feed-forward geometric foundation models can infer dense point clouds and camera motion directly from RGB streams, providing priors for monocular SLAM. However, their predictions are often view-dependent and noisy: geometry can vary across viewpoints and under image transformations, and local metric properties may drift between frames. We present MonoEM-GS, a monocular mapping pipeline that integrates such geometric predictions into a global Gaussian Splatting representation while explicitly addressing these inconsistencies. MonoEM-GS couples Gaussian Splatting with an Expectation--Maximization formulation to stabilize geometry, and employs ICP-based alignment for monocular pose estimation. Beyond geometry, MonoEM-GS parameterizes Gaussians with multi-modal features, enabling in-place open-set segmentation and other downstream queries directly on the reconstructed map. We evaluate MonoEM-GS on 7-Scenes, TUM RGB-D and Replica, and compare against recent baselines.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces the first integration of Expectation-Maximization with 3D Gaussian splatting to fuse monocular RGB data into a consistent global SLAM map.
It employs a probabilistic GMM representation augmented with DINOv3 multi-modal features, enabling in-place semantic segmentation and robust pose estimation.
Experiments on standard SLAM datasets demonstrate state-of-the-art mapping accuracy, superior normal consistency, and improved scale stability for room-scale scenes.

MonoEM-GS: Monocular EM-Based Gaussian Splatting for Geometrically Consistent SLAM

Introduction

The paper "MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM" (2604.10593) introduces a novel monocular SLAM pipeline that systematically addresses the limitations of recent feed-forward geometric foundation models in generating temporally consistent and metrically accurate scene reconstructions. The method, MonoEM-GS, fuses predicted dense point clouds from monocular RGB streams into a global scene model through an incremental Expectation-Maximization (EM) procedure, leveraging a sparse-to-dense 3D Gaussian Splatting (GS) representation. The core contributions include: (i) the first integration of EM and GS for monocular SLAM; (ii) the use of multi-modal features within each Gaussian to enable in-place open-set segmentation; and (iii) an ICP-based alignment strategy for robust monocular pose estimation.

Background and Motivation

Traditional monocular SLAM systems rely heavily on geometric optimization involving visual features, multiview pose graph construction, and local bundle adjustment. While advances in feed-forward geometric models such as MapAnything, VGGT, and MASt3R have shifted the field towards direct inference of dense geometry and camera poses, these approaches frequently exhibit artifacts: (i) predicted geometry is view-dependent and unstable across simple image or pose transformations, (ii) local metric properties drift, and (iii) global scene consistency is weak. These prediction inconsistencies rapidly accumulate, resulting in ambiguous 3D structures and scale drift, fundamentally limiting mapping and downstream tasks.

MonoEM-GS directly addresses these limitations: It enforces cross-frame geometric consistency by merging predictions with EM updates under a probabilistic, sparse GMM scene model. By representing the map as a 3D Gaussian mixture and parameterizing each Gaussian not only with geometry but also with high-dimensional features, it enables both robust reconstruction and extensibility to semantics in open-vocabulary settings.

Methodology

Scene Representation

The proposed framework represents the evolving map as a set of parameterized 3D Gaussians, $\mathcal{M} = \{(\mu_k, \Sigma_k, c_k, \bar{n}_k, f_k)\}_{k=1}^K$ , where each Gaussian captures its center, covariance, color, surface normal, and a DINOv3 feature embedding. This sparse mixture supports efficient storage of multi-view, multi-modal measurements and is inherently suitable for both GS rendering and probabilistic measurement fusion.

Pipeline Overview

Model Inference: For each incoming monocular RGB frame, MapAnything (or compatible models) predicts a dense point cloud and associated per-point features and colors. Predictions are obtained from a FIFO buffer of recent images, ensuring up-to-date local geometry and leveraging new frames for temporal refinement.
Map Initialization: Initial frames are clustered using FAISS to generate Gaussian means and covariances; corresponding features, colors, and normals are computed as weighted averages, resulting in a consistent multi-modal initialization.
Localization: Pose estimation is achieved through ICP-based coarse-to-fine alignment. Coherent associations are established by exploiting repeated predictions for buffered frames, followed by colored ICP alignment to a local submap.
Expectation-Maximization Map Update: Following pose alignment, new observations update the map via an EM procedure. For each measured point, the algorithm computes responsibilities over local Gaussian neighbors based on Mahalanobis distance, normal consistency, and feature cosine similarity. Only points well-explained by existing Gaussians are merged; outliers or novel-region points are deferred to a rasterization-based refinement.
Rasterization-based Refinement: Unmatched predictions are rendered and compared with the map using GS; unmapped regions are appended as new Gaussians. This ensures map growth while minimizing spurious integrations due to prediction noise.
In-Place Semantic Queries: Since each Gaussian is tagged with a high-dimensional DINOv3 feature, open-set segmentation and other queries are performed directly on the reconstruction, eliminating the need for external post-processing.

Key Claims

Geometric Consistency: MonoEM-GS substantially mitigates the temporal inconsistencies of feed-forward 3D prediction models by integrating cross-frame information probabilistically.
Sparse Representation with Multi-Modal Features: The GMM representation, in contrast to denser point cloud-based approaches, supports in-map storage of high-dimensional features, enabling efficient semantic segmentation and feature queries.
Extensible, Modular Pipeline: The inference stage is decoupled, permitting the use of alternative geometric foundation models as future advances arrive.

Experimental Evaluation

Datasets and Baselines

Experiments are conducted on established SLAM datasets: TUM RGB-D, 7-Scenes, and Replica. Baselines include VGGT-SLAM, VGGT-SLAM2, MASt3R-SLAM, and MapAnything (using its native sequential pose estimation). All methods are evaluated for mapping accuracy, completeness, normal consistency, F1 score, and absolute trajectory error (ATE).

Quantitative Results

MonoEM-GS achieves the highest or second-highest performance across nearly all mapping and localization metrics. Notably:

On 7-Scenes, MonoEM-GS attains superior F1 score and surface normal consistency over all baselines, reflecting the efficacy of EM-based fusion for geometrically coherent reconstructions.
For scale consistency, MonoEM-GS, along with MASt3R-SLAM, exhibits highly stable metric estimation, a critical property for downstream robotics tasks.
In Replica 3D open-set segmentation, MonoEM-GS delivers the highest frequency-weighted mIoU (f-mIoU), due to better performance on dominant classes enabled by in-place DINOv3 feature assignments. While OMCL achieves higher overall accuracy, MonoEM-GS surpasses open-vocabulary methods in matched-class performance.

Qualitative Analysis

The sparser MonoEM-GS map produces visually cleaner and less ambiguous reconstructions than the denser VGGT-based pipelines, as evidenced by reduced shape artifacts and higher-quality mesh extraction.

Limitations

The system is currently restricted to small (room-scale) environments due to the absence of loop closure mechanisms. It also remains susceptible to front-surface noise if prediction failures are not adequately filtered by the EM and rasterization steps.

Implications and Future Directions

The presented methodology formalizes a probabilistic strategy for integrating noisy, view-dependent geometric predictions into a stable, queryable map—bridging the gap between feed-forward foundation models and consistent, extensible SLAM frameworks. The in-map multi-modal feature storage constitutes a direct enabler for real-time open-vocabulary semantics, making this approach highly relevant for robotics and embodied AI systems demanding rich, long-horizon scene understanding.

Potential future directions include:

Extension to large-scale and multi-room environments via loop closure and submap graph optimization.
Integration of additional modalities (e.g., text, affordance, objectness) through expanded per-Gaussian descriptors.
Application to real-world robotics scenarios where scalability, update speed, and semantic richness are critical.

Conclusion

MonoEM-GS unites EM-based probabilistic data association and 3D Gaussian Splatting with learned geometric priors from feed-forward models, delivering a monocular SLAM pipeline with enhanced geometric consistency, extensible semantic capabilities, and practical modularity. The system achieves state-of-the-art sparse-to-dense mapping quality and in-place semantic understanding in monocular RGB settings, although future work is required to achieve robust large-scale performance and close the loop in metric consistency (2604.10593).

Markdown Report Issue