Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations
Abstract: Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is a nested schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives. Then, the NSI metric learns to ground primitives into slots through a structured contrastive learning objective that reasons over the intermodal alignment. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. From a scene representation standpoint, we find that emergent NSI slots that move beyond the image grid by binding to spatial objects facilitate improved visual grounding compared to conventional bounding-box-based approaches. From a data efficiency standpoint, we empirically validate that NSI learns more generalizable representations from a fixed amount of annotation data than the traditional approach. We also show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene complexity. Finally, we investigate the downstream efficacy of the grounded slots. Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on challenging few-shot classification tasks.
- Sparks of Artificial General Intelligence: Early Experiments with GPT-4. CoRR, abs/2303.12712, 2023.
- MONet: Unsupervised Scene Decomposition and Representation. CoRR, abs/2303.08774, 2019.
- End-to-End Object Detection with Transformers. CoRR, abs/2005.12872, 2020.
- Emerging Properties in Self-Supervised Vision Transformers. CoRR, abs/2104.14294, 2021.
- Neural Constraint Satisfaction: Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement. CoRR, abs/2303.11373, 2023.
- Im-Promptu: In-Context Composition from Image Prompts. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- RobustFill: Neural Program Learning under Noisy I/O. CoRR, abs/1703.07469, 2017.
- Structured Information Extraction from Complex Scientific Text with Fine-Tuned Large Language Models. CoRR, abs/2212.05238, 2022.
- DreamCoder: Growing Generalizable, Interpretable Knowledge with Wake-Sleep Bayesian Program Learning. CoRR, abs/2006.08381, 2020.
- SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. CoRR, abs/2206.07664, 2022.
- GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations. CoRR, abs/1907.13052, 2019.
- GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement. CoRR, abs/2104/09958, 2021.
- Multi-Object Representation Learning with Iterative Variational Inference. CoRR, abs/1903.00450, 2019.
- On the Binding Problem in Artificial Neural Networks. CoRR, abs/2012.05208, 2020.
- Kubric: A Scalable Dataset Generator. CoRR, abs/2203.03570, 2022.
- DORSal: Diffusion for Object-centric Representations of Scenes et al. CoRR, abs/2306.08068, 2023.
- Object-Centric Slot Diffusion. CoRR, abs/2303.10834, 2023.
- ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation. CoRR, abs/2111.10265, 2021.
- Conditional Object-Centric Learning from Video. CoRR, abs/2111.12594, 2021.
- Kuhn, H. W. The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 52, 1955.
- Human-like Systematic Generalization through a Meta-Learning Neural Network. Nature, 623(7985):115–121, 2023.
- One Shot Learning of Simple Visual Concepts. Cognitive Science, 33, 2011.
- Human Few-Shot Learning of Compositional Instructions. CoRR, abs/1901.04587, 2019.
- Perspective Plane Program Induction from a Single Image. CoRR, abs/2006.14708, 2020.
- Microsoft COCO: Common Objects in Context. CoRR, abs/1405.0312, 2015.
- Object-Centric Learning with Slot Attention. CoRR, abs/2006.15055, 2020.
- Learning Compositional Rules via Neural Program Synthesis. CoRR, abs/2003.05562, 2020.
- Elucidating Image-to-Set Prediction: An Analysis of Models, Losses and Datasets. CoRR, abs/1904.05709, 2020.
- Compositionality of Rule Representations in Human Prefrontal Cortex. Cerebral Cortex, 22(6):1237–1246, 2012.
- DeepSetNet: Predicting Sets with Deep Neural Networks. CoRR, abs/1611.08998, 2017.
- Bridging the Gap to Real-World Object-Centric Learning. CoRR, abs/2209.14860, 2023.
- Illiterate DALL-E Learns to Compose. CoRR, abs/2110.11405, 2021.
- Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. CoRR, abs/2205.14065, 2022.
- Rethinking Transformer-based Set Prediction for Object Detection. CoRR, abs/2011.10881, 2021.
- Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86):2579–2605, 2008.
- Neural Scene De-rendering. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. CoRR, abs/2210.05861, 2023a.
- SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models. CoRR, abs/2305.11281, 2023b.
- Neural Task Programming: Learning to Generalize Across Hierarchical Tasks. CoRR, abs/1710.01813, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.