Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
Abstract: 3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query LLMs to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities-from zero-shot composition, to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision.
- 3DRefTransformer: Fine-grained Object Identification in Real-world Scenes Using Natural Language. In WACV, pages 3941–3950, 2022.
- ReferIt3D: Neural Listeners for Fine-grained 3D Object Identification in Real-world Scenes. In ECCV, pages 422–440. Springer, 2020.
- Learning continuous semantic representations of symbolic expressions. In ICML, 2017.
- Learning to Compose Neural Networks for Question Answering. In NAACL-HLT, 2016.
- Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding. In NeurIPS, pages 37146–37158, 2022.
- Language Models are Few-Shot Learners. NeurIPS, 33:1877–1901, 2020.
- 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. In CVPR, pages 16464–16473, 2022.
- ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language. In ECCV, pages 202–221. Springer, 2020.
- D3Net: A Speaker-listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans, 2021a.
- HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding. arXiv preprint arXiv:2210.12513, 2022.
- Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning. In ICLR, 2021b.
- ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In CVPR, pages 5828–5839, 2017.
- A review of some techniques for inclusion of domain-knowledge into deep neural networks. Scientific Reports, 12(1):1040, 2022.
- Integrating machine learning with human knowledge. Iscience, 23(11), 2020.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL, 2019.
- Semantic-based Regularization for Learning and Inference. In Artificial Intelligence 244 (2017) 143–165, 2015.
- Integrating prior knowledge into deep learning. In International Conference on Machine Learning and Applications, 2017.
- Motion question answering via modular motion programs. ICML, 2023.
- Fast relational learning using bottom clause propositionalization with artificial neural networks. Machine learning, 94:81–104, 2014.
- Deep learning with logical constraints. In IJCAI, 2022.
- Similar, and similar concepts. Cognition, 58(3):321–376, 1996.
- Visual Concept-Metaconcept Learning. In NeurIPS, 2019.
- TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. In ACM International Conference on Multimedia, pages 2344–2352, 2021.
- MultiplexNet: Towards Fully Satisfied Logical Constraints in Neural Networks. In AAAI, pages 5700–5709, 2022.
- What’s Left? Concept Grounding with Logic-Enhanced Foundation Models. NeurIPS, 2023a.
- NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations. In CVPR, pages 2614–2623, 2023b.
- Harnessing Deep Neural Networks with Logic Rules. In ACL, 2016.
- Text-guided Graph Neural Networks for Referring 3D Instance Segmentation. In AAAI, pages 1610–1618, 2021.
- Multi-View Transformer for 3D Visual Grounding. In CVPR, 2022.
- Learning by Abstraction: The Neural State Machine. In NeurIPS, 2019.
- Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds. In ECCV, pages 417–433. Springer, 2022.
- Inferring and Executing Programs for Visual Reasoning. In ICCV, 2017.
- Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning. In ICML, 2020.
- Augmenting Neural Networks with First-Order Logic. ACL, 2019.
- 3D-SPS: Single-stage 3D Visual Grounding via Referred Point Progressive Selection. In CVPR, pages 16454–16463, 2022.
- The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. In ICLR, 2019.
- PDSketch: Integrated Domain Programming, Learning, and Planning. In NeurIPS, 2022.
- Children’s use of mutual exclusivity to constrain the meanings of words. Cognitive psychology, 20(2):121–157, 1988.
- Relational neural machines. In ECAI, 2020.
- George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. NeurIPS, 30, 2017.
- Deep Hough Voting for 3D Object Detection in Point Clouds. In CVPR, pages 9277–9286, 2019.
- LanguageRefer: Spatial-Language Model for 3D Visual Grounding. In CoRL, pages 1046–1056. PMLR, 2022.
- DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:1910.01108, 2019.
- Label-free supervision of neural networks with physics and domain knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
- Knowledge-based artificial neural networks. Artificial intelligence, 70(1-2):119–165, 1994.
- Attention is All You Need. NeurIPS, 30, 2017.
- Informed machine learning–a taxonomy and survey of integrating prior knowledge into learning systems. Transactions on Knowledge and Data Engineering, 35(1):614–633, 2021.
- From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought. arXiv preprint arXiv:2306.12672, 2023.
- Embedding Symbolic Knowledge into Deep Networks. In NeurIPS, 2019.
- A Semantic Loss Function for Deep Learning with Symbolic Knowledge. In ICML, 2018.
- SAT: 2D Semantics Assisted Training for 3D Visual Grounding. In ICCV, pages 1856–1866, 2021.
- Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In NeurIPS, 2018.
- InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. In ICCV, pages 1791–1800, 2021.
- 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds. In ICCV, pages 2928–2937, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.