Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neural Language of Thought Models

Published 2 Feb 2024 in cs.LG and cs.CV | (2402.01203v2)

Abstract: The Language of Thought Hypothesis suggests that human cognition operates on a structured, language-like system of mental representations. While neural LLMs can naturally benefit from the compositional structure inherently and explicitly expressed in language data, learning such representations from non-linguistic general observations, like images, remains a challenge. In this work, we introduce the Neural Language of Thought Model (NLoTM), a novel approach for unsupervised learning of LoTH-inspired representation and generation. NLoTM comprises two key components: (1) the Semantic Vector-Quantized Variational Autoencoder, which learns hierarchical, composable discrete representations aligned with objects and their properties, and (2) the Autoregressive LoT Prior, an autoregressive transformer that learns to generate semantic concept tokens compositionally, capturing the underlying data distribution. We evaluate NLoTM on several 2D and 3D image datasets, demonstrating superior performance in downstream tasks, out-of-distribution generalization, and image generation quality compared to patch-based VQ-VAE and continuous object-centric representations. Our work presents a significant step towards creating neural networks exhibiting more human-like understanding by developing LoT-like representations and offers insights into the intersection of cognitive science and machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Object-centric image generation with factored depths, locations, and appearances. arXiv preprint arXiv:2004.00642, 2020.
  2. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  39–48, 2016.
  3. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL http://arxiv.org/abs/1308.3432.
  4. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
  5. ROOTS: Object-centric representation and rendering of 3D scenes. Journal of Machine Learning Research, 22(259):1–36, 2021. URL http://jmlr.org/papers/v22/20-1176.html.
  6. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  7. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  8. Exploiting spatial invariance for scalable unsupervised object tracking. arXiv preprint arXiv:1911.09033, 2019a.
  9. Spatially invariant unsupervised object detection with convolutional neural networks. In Proceedings of AAAI, 2019b.
  10. The helmholtz machine. Neural computation, 7(5):889–904, 1995.
  11. Generative scene graph networks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=RmcPm9m3tnk.
  12. Jukebox: A generative model for music. CoRR, abs/2005.00341, 2020. URL https://arxiv.org/abs/2005.00341.
  13. Generalization and robustness implications in object-centric learning. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  5221–5285. PMLR, 2022. URL https://proceedings.mlr.press/v162/dittadi22a.html.
  14. Google scanned objects: A high-quality dataset of 3d scanned household items, 2022. URL https://arxiv.org/abs/2204.11918.
  15. Unsupervised discovery of 3d physical objects from video, 2021.
  16. GENESIS: generative scene inference and sampling with object-centric latent representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=BkxfaTVFwH.
  17. Genesis-v2: Inferring unordered object representations without iterative refinement, 2022.
  18. Attend, infer, repeat: Fast scene understanding with generative models. In Advances in Neural Information Processing Systems, pp.  3225–3233, 2016.
  19. Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp.  12873–12883. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.01268. URL https://openaccess.thecvf.com/content/CVPR2021/html/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html.
  20. Unsupervised learning of temporal abstractions with slot-based transformers. arXiv preprint arXiv:2203.13573, 2022.
  21. Robert M. Gray. Vector quantization. IEEE ASSP Magazine, 1:4–29, 1984. URL https://api.semanticscholar.org/CorpusID:14754287.
  22. Neural expectation maximization. In Advances in Neural Information Processing Systems, pp.  6691–6701, 2017.
  23. Multi-object representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450, 2019.
  24. On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208, 2020.
  25. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  26. Object discovery and representation networks. In ECCV, pp.  123–143. Springer, 2022.
  27. Generative neurosymbolic machines. In Advances in Neural Information Processing Systems, 2020.
  28. Scalor: Generative world models with scalable object representations. In International Conference on Learning Representations, 2019.
  29. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  2901–2910, 2017.
  30. Simone: View-invariant, temporally-abstracted object representations via unsupervised video decomposition. arXiv preprint arXiv:2106.03849, 2021.
  31. Clevrtex: A texture-rich benchmark for unsupervised multi-object segmentation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/e2c420d928d4bf8ce0ff2ec19b371514-Abstract-round2.html.
  32. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  33. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  34. Conditional Object-Centric Learning from Video. In International Conference on Learning Representations (ICLR), 2022.
  35. Quantized disentangled representations for object-centric visual tasks, 2023. URL https://openreview.net/forum?id=JIptuwnqwn.
  36. Building machines that learn and think like people. CoRR, abs/1604.00289, 2016. URL http://arxiv.org/abs/1604.00289.
  37. Improving generative imagination in object-centric world models. In International Conference on Machine Learning, pp.  4114–4124, 2020a.
  38. Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. In International Conference on Learning Representations, 2020b.
  39. Object-centric learning with slot attention, 2020.
  40. Planning in the brain. Neuron, 110(6):914–934, 2022.
  41. Stephen E. Palmer. Hierarchical structure in perceptual representation. Cognitive Psychology, 9(4):441–474, 1977. ISSN 0010-0285. doi: https://doi.org/10.1016/0010-0285(77)90016-0. URL https://www.sciencedirect.com/science/article/pii/0010028577900160.
  42. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  8821–8831. PMLR, 2021. URL http://proceedings.mlr.press/v139/ramesh21a.html.
  43. Generating diverse high-fidelity images with VQ-VAE-2. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  14837–14847, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/5f8e2fa1718d1bbcadf1cd9c7a54fb8c-Abstract.html.
  44. Stochastic backpropagation and variational inference in deep latent gaussian models. In International Conference on Machine Learning, volume 2, 2014.
  45. Bridging the gap to real-world object-centric learning. arXiv preprint arXiv:2209.14860, 2022.
  46. Wolf Singer. Binding by synchrony. Scholarpedia, 2:1657, 2007.
  47. Illiterate DALL-E learns to compose. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a. URL https://openreview.net/forum?id=h0OYV0We3oh.
  48. Simple unsupervised object-centric learning for complex and naturalistic videos. In NeurIPS, 2022b. URL http://papers.nips.cc/paper_files/paper/2022/hash/735c847a07bf6dd4486ca1ace242a88c-Abstract-Conference.html.
  49. Neural systematic binder. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ZPHE4fht19t.
  50. Core knowledge. Developmental science, 10(1):89–96, 2007.
  51. Capsules with inverted dot-product attention routing. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=HJe6uANtwH.
  52. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp.  4790–4798, 2016.
  53. Neural discrete representation learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  6306–6315, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fbc-Abstract.html.
  54. Towards causal generative scene models via competition of experts, 2020.
  55. Nicholas J Wade. The vision of helmholtz. Journal of the History of the Neurosciences, 30(4):405–424, 2021.
  56. Cut and learn for unsupervised object detection and instance segmentation, 2023a.
  57. Slot-vae: Object-centric scene generation with slot attention. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  36020–36035. PMLR, 2023b. URL https://proceedings.mlr.press/v202/wang23r.html.
  58. Spriteworld: A flexible, configurable reinforcement learning environment. https://github.com/deepmind/spriteworld/, 2019a. URL https://github.com/deepmind/spriteworld/.
  59. Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes. CoRR, abs/1901.07017, 2019b. URL http://arxiv.org/abs/1901.07017.
  60. Systematic visual reasoning through object-centric relational abstraction. CoRR, abs/2306.02500, 2023a. doi: 10.48550/arXiv.2306.02500. URL https://doi.org/10.48550/arXiv.2306.02500.
  61. Systematic visual reasoning through object-centric relational abstraction. CoRR, abs/2306.02500, 2023b. doi: 10.48550/arXiv.2306.02500. URL https://doi.org/10.48550/arXiv.2306.02500.
  62. Self-supervised visual representation learning with semantic grouping. arXiv preprint arXiv:2205.15288, 2022.
  63. Generative video transformer: Can objects be the words? In International Conference on Machine Learning, pp.  11307–11318. PMLR, 2021.
  64. Videogpt: Video generation using VQ-VAE and transformers. CoRR, abs/2104.10157, 2021. URL https://arxiv.org/abs/2104.10157.
  65. An investigation into pre-training object-centric representations for reinforcement learning. CoRR, abs/2302.04419, 2023. doi: 10.48550/arXiv.2302.04419. URL https://doi.org/10.48550/arXiv.2302.04419.
  66. Vector-quantized image modeling with improved VQGAN. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=pfNyExj7z2.
  67. Robust and controllable object-centric learning through energy-based models. arXiv preprint arXiv:2210.05519, 2022.
  68. Parts: Unsupervised segmentation with slots, attention and independence maximization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10439–10447, 2021.
Citations (4)

Summary

  • The paper presents a new Neural Language of Thought Model (NLoTM) that uses block-level vector quantization and an autoregressive prior to achieve compositional scene decomposition.
  • Empirical results show improved FID scores and up to 99.1% OOD accuracy, validating the importance of factor-level representations in complex object-centric tasks.
  • The study bridges neural scene representation with symbolic reasoning, paving the way for advanced generative models with enhanced interpretability and generalization.

Neural Language of Thought Models: Structured Discrete Representation and Generation

Overview

The paper "Neural Language of Thought Models" (2402.01203) advances unsupervised compositional representation learning from non-linguistic data. It formalizes desiderata for neural systems emulating human-like mentalese: compositional scene decomposition, discrete symbolic concept abstraction, and efficient probabilistic compositional generation. The proposed Neural Language of Thought Model (NLoTM) combines an object-centric discrete encoder—Semantic Vector-Quantized VAE (SVQ)—with an object-property-level autoregressive prior—Autoregressive LoT Prior (ALP). This architecture demonstrates competitive results in downstream object-centric tasks and generative modeling, particularly addressing out-of-distribution generalization failures found in patch-based and continuous models.

Theoretical Motivation

Human cognition is theorized to rely on compositional, symbol-like mental representations ("Language of Thought"). Artificial neural networks trained on language naturally internalize such structure, but learning it directly from scene observations (e.g., images) has remained elusive. Previous advances in object-centric learning, e.g., slot attention, have enabled semantic decomposition but generally rely on continuous representations and do not facilitate density-based compositional sampling. Mainstream discrete models (VQ-VAE, dVAE, VQ-GAN) quantize at patch level, failing to capture global semantics and suffering combinatorial inefficiency in representing object variations.

NLoTM explicitly addresses these gaps via block-level discrete factorization, enabling combinatorial generalization with tractable codebooks and supporting autoregressive generative modeling over objects and their properties.

Semantic Vector-Quantized VAE (SVQ): Architecture and Factorization

SVQ extends slot attention-based object decomposition by introducing block-level vector quantization within each slot. Each slot (object representation) is partitioned into MM blocks, each describing a distinct property (e.g., color, shape, position) and mapped to a shared codebook specific to that semantic factor. This approach avoids exponential codebook growth with combinatorial properties, with block granularity preventing entanglement and promoting reuse of discrete codes. Figure 1

Figure 1: Comparison between VQ-VAE, Quantized Slots, and SVQ; SVQ achieves semantic factor-level quantization, drastically reducing codebook complexity for combinatorial object configurations.

SVQ replaces slot-level recurrent and residual blocks with block-level equivalents, supporting independent updating and quantization. EMA codebook updates and random embedding restarts stabilize training and mitigate codebook collapse. The resulting discrete latent zq∈RN×M×dcz_{q} \in \mathbb{R}^{N \times M \times d_{c}} is interpreted as a set of symbolic tokens, analogous to words in a sentence, with clean semantic entanglement separation.

Autoregressive LoT Prior (ALP): Object-Property Level Generation

ALP models the joint distribution over SVQ codes using a transformer decoder, flattening slots and blocks to a fixed vector. Unlike patch-based priors, ALP samples objects and their semantic factors autoregressively, with scene order encoded positionally. This enables object-wise compositional synthesis and implies superior generative efficiency; the number of tokens required is O(NM)O(NM), decoupled from image size, and directly tied to scene semantic complexity.

Generative sampling proceeds by drawing factor-level codes for each object one by one, subsequently decoding them into scenes via the SVQ decoder.

Empirical Findings

NLoTM was benchmarked on 2D Sprites and 3D CLEVR variants, including texture-rich scenes. It exhibited the following properties:

  • Improved FID scores (≈40–85 for CLEVR variants) and higher Generation Accuracy in multi-object scene synthesis over VQ-VAE, dVAE, GENESIS-v2.
  • Superior out-of-distribution generalization in downstream odd-one-out and property comparison tasks; up to 99.1% OOD accuracy with SVQ codebook latent representations.
  • SVQ block-level quantization empirically outperformed naive slot-level quantization, confirming the theoretical hypothesis that factor-level representation is critical for complex scenes.
  • Segmentation (FG-ARI) competitive with SysBinder; significantly above vanilla slot attention.
  • Generative scaling demonstrated on Google Scanned Objects, with improved qualitative and FID upon model size increase.

The model performed robustly even with discrete bottlenecks in challenging datasets requiring recognition and compositional generation of complex object attributes.

Comparative Analysis

  • Patch-level VQ-VAEs fail to model global semantics and manifest blurry or malformed object syntheses as scene/textural complexity increases.
  • Increasing patch-based transformer prior capacity in dVAE marginally improves generative scores but cannot rival block-level SVQ performance.
  • Downstream OOD generalization requires latent representations that encode factor-level invariances without relying solely on discrete code indices—prototype vectors in SVQ satisfy this, yielding near-perfect OOD identification in relational tasks.

Implications, Limitations, and Future Directions

The NLoTM paradigm effectively bridges object-centric vision and symbolic modeling, aligning neural scene representations with Fodor-style mentalese abstractions. Factor-level discrete sampling and modular codebooks can be leveraged for new classes of generative models in planning, simulation, and concept manipulation. The model addresses the long-standing combinatorial explosion in object-centric discrete modeling and offers tractable density-based scene generation.

Limitations include current evaluation on synthetic scenes and absence of explicit continuous factor integration (position, pose). Future extensions should address realistic, high-resolution, naturalistic environments and hybridize discrete and continuous latent variables to model natural scenes more accurately. Incorporation of more sophisticated priors or hierarchical grammars could further improve generalization and abstraction capacities.

Ethical considerations include potential misuse for generating realistic fake images, necessitating the development of control mechanisms and responsible deployment protocols.

Conclusion

Neural Language of Thought Models provide an operational framework for unsupervised, object-centric, compositional discrete representation learning and scene generation, realizing LoT desiderata through slot-factor vector quantization and autoregressive object-property priors. This approach demonstrates measurable advances in compositional generalization, interpretability, and generative modeling efficiency, motivating future developments in structured neural reasoning and neuro-symbolic integration.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 14 likes about this paper.