CLIP-GS: CLIP-Informed Gaussian Splatting for View-Consistent 3D Indoor Semantic Understanding
Abstract: Exploiting 3D Gaussian Splatting (3DGS) with Contrastive Language-Image Pre-Training (CLIP) models for open-vocabulary 3D semantic understanding of indoor scenes has emerged as an attractive research focus. Existing methods typically attach high-dimensional CLIP semantic embeddings to 3D Gaussians and leverage view-inconsistent 2D CLIP semantics as Gaussian supervision, resulting in efficiency bottlenecks and deficient 3D semantic consistency. To address these challenges, we present CLIP-GS, efficiently achieving a coherent semantic understanding of 3D indoor scenes via the proposed Semantic Attribute Compactness (SAC) and 3D Coherent Regularization (3DCR). SAC approach exploits the naturally unified semantics within objects to learn compact, yet effective, semantic Gaussian representations, enabling highly efficient rendering (>100 FPS). 3DCR enforces semantic consistency in 2D and 3D domains: In 2D, 3DCR utilizes refined view-consistent semantic outcomes derived from 3DGS to establish cross-view coherence constraints; in 3D, 3DCR encourages features similar among 3D Gaussian primitives associated with the same object, leading to more precise and coherent segmentation results. Extensive experimental results demonstrate that our method remarkably suppresses existing state-of-the-art approaches, achieving mIoU improvements of 21.20% and 13.05% on ScanNet and Replica datasets, respectively, while maintaining real-time rendering speed. Furthermore, our approach exhibits superior performance even with sparse input data, substantiating its robustness.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision, pages 405–421, 2020.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
- Zip-nerf: Anti-aliased grid-based neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19697–19705, 2023.
- Tensorf: Tensorial radiance fields. In Proceedings of the European Conference on Computer Vision, pages 333–350, 2022.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 41(4):1–15, 2022.
- 3d reconstruction and new view synthesis of indoor environments based on a dual neural radiance field. arXiv preprint arXiv:2401.14726, 2024.
- Nerf: Neural radiance field in 3d vision, a comprehensive review. arXiv preprint arXiv:2210.00379, 2022.
- Visual language maps for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 10608–10615, 2023.
- Tdrnet: Transformer-based dual-branch restoration network for geometry based point cloud compression artifacts. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2022.
- Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3):1341–1360, 2020.
- Conceptfusion: Open-set multimodal 3d mapping. In Proceedings of Robotics: Science and Systems, 2023.
- Ov-nerf: Open-vocabulary neural radiance fields with vision and language foundation models for 3d semantic understanding. arXiv preprint arXiv:2402.04648, 2024.
- From multi-view to hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection. IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4722–4734, 2021.
- Mmnet: Multi-stage and multi-scale fusion network for rgb-d salient object detection. In Proceedings of the ACM international conference on multimedia, pages 2436–2444, 2020.
- Dense object grounding in 3d scenes. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5017–5026, 2023.
- Cross-modal unsupervised domain adaptation for 3d semantic segmentation via bidirectional fusion-then-distillation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 490–498, 2023.
- Language-driven semantic segmentation. In Proceedings of the International Conference on Learning Representations, 2022.
- Clipself: Vision transformer distills itself for open-vocabulary dense prediction. In Proceedings of the International Conference on Learning Representations, 2024.
- Language-augmented pixel embedding for generalized zero-shot learning. IEEE Transactions on Circuits and Systems for Video Technology, 33(3):1019–1030, 2022.
- Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023.
- Vlm2scene: Self-supervised image-text-lidar learning with foundation models for autonomous driving scene understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 3351–3359, 2024.
- Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
- A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023.
- Weakly supervised 3d open-vocabulary segmentation. In Proceedings of the Advances in Neural Information Processing Systems, 2023.
- Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19729–19739, 2023.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
- Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In Proceedings of the International Conference on Learning Representations, 2024.
- 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Gaussian splatting slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
- Decomposing nerf for editing via feature field distillation. In Proceedings of the Advances in Neural Information Processing Systems, volume 35, pages 23311–23330, 2022.
- Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding. arXiv preprint arXiv:2401.01970, 2024.
- Language embedded 3d gaussians for open-vocabulary scene understanding. arXiv preprint arXiv:2311.18482, 2023.
- Langsplat: 3d language gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Point-based neural rendering with per-view optimization. In Computer Graphics Forum, volume 40, pages 29–43. Wiley Online Library, 2021.
- Shimon Ullman. The interpretation of structure from motion. Proceedings of the Royal Society of London. Series B. Biological Sciences, 203(1153):405–426, 1979.
- Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
- Ewa volume splatting. In Proceedings Visualization, 2001. VIS’01., pages 29–538. IEEE, 2001.
- Tracking anything with decoupled video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1316–1326, 2023.
- The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
- Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics, 36(4):1, 2017.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.