Neural Language of Thought Models

Published 2 Feb 2024 in cs.LG and cs.CV | (2402.01203v2)

Abstract: The Language of Thought Hypothesis suggests that human cognition operates on a structured, language-like system of mental representations. While neural LLMs can naturally benefit from the compositional structure inherently and explicitly expressed in language data, learning such representations from non-linguistic general observations, like images, remains a challenge. In this work, we introduce the Neural Language of Thought Model (NLoTM), a novel approach for unsupervised learning of LoTH-inspired representation and generation. NLoTM comprises two key components: (1) the Semantic Vector-Quantized Variational Autoencoder, which learns hierarchical, composable discrete representations aligned with objects and their properties, and (2) the Autoregressive LoT Prior, an autoregressive transformer that learns to generate semantic concept tokens compositionally, capturing the underlying data distribution. We evaluate NLoTM on several 2D and 3D image datasets, demonstrating superior performance in downstream tasks, out-of-distribution generalization, and image generation quality compared to patch-based VQ-VAE and continuous object-centric representations. Our work presents a significant step towards creating neural networks exhibiting more human-like understanding by developing LoT-like representations and offers insights into the intersection of cognitive science and machine learning.

Abstract PDF HTML Upgrade to Chat

References (68)

Citations (4)

View on Semantic Scholar

Summary

The paper presents a new Neural Language of Thought Model (NLoTM) that uses block-level vector quantization and an autoregressive prior to achieve compositional scene decomposition.
Empirical results show improved FID scores and up to 99.1% OOD accuracy, validating the importance of factor-level representations in complex object-centric tasks.
The study bridges neural scene representation with symbolic reasoning, paving the way for advanced generative models with enhanced interpretability and generalization.

Neural Language of Thought Models: Structured Discrete Representation and Generation

Overview

The paper "Neural Language of Thought Models" (2402.01203) advances unsupervised compositional representation learning from non-linguistic data. It formalizes desiderata for neural systems emulating human-like mentalese: compositional scene decomposition, discrete symbolic concept abstraction, and efficient probabilistic compositional generation. The proposed Neural Language of Thought Model (NLoTM) combines an object-centric discrete encoder—Semantic Vector-Quantized VAE (SVQ)—with an object-property-level autoregressive prior—Autoregressive LoT Prior (ALP). This architecture demonstrates competitive results in downstream object-centric tasks and generative modeling, particularly addressing out-of-distribution generalization failures found in patch-based and continuous models.

Theoretical Motivation

Human cognition is theorized to rely on compositional, symbol-like mental representations ("Language of Thought"). Artificial neural networks trained on language naturally internalize such structure, but learning it directly from scene observations (e.g., images) has remained elusive. Previous advances in object-centric learning, e.g., slot attention, have enabled semantic decomposition but generally rely on continuous representations and do not facilitate density-based compositional sampling. Mainstream discrete models (VQ-VAE, dVAE, VQ-GAN) quantize at patch level, failing to capture global semantics and suffering combinatorial inefficiency in representing object variations.

NLoTM explicitly addresses these gaps via block-level discrete factorization, enabling combinatorial generalization with tractable codebooks and supporting autoregressive generative modeling over objects and their properties.

Semantic Vector-Quantized VAE (SVQ): Architecture and Factorization

SVQ extends slot attention-based object decomposition by introducing block-level vector quantization within each slot. Each slot (object representation) is partitioned into $M$ blocks, each describing a distinct property (e.g., color, shape, position) and mapped to a shared codebook specific to that semantic factor. This approach avoids exponential codebook growth with combinatorial properties, with block granularity preventing entanglement and promoting reuse of discrete codes.

Figure 1: Comparison between VQ-VAE, Quantized Slots, and SVQ; SVQ achieves semantic factor-level quantization, drastically reducing codebook complexity for combinatorial object configurations.

SVQ replaces slot-level recurrent and residual blocks with block-level equivalents, supporting independent updating and quantization. EMA codebook updates and random embedding restarts stabilize training and mitigate codebook collapse. The resulting discrete latent $z_{q} \in \mathbb{R}^{N \times M \times d_{c}}$ is interpreted as a set of symbolic tokens, analogous to words in a sentence, with clean semantic entanglement separation.

Autoregressive LoT Prior (ALP): Object-Property Level Generation

ALP models the joint distribution over SVQ codes using a transformer decoder, flattening slots and blocks to a fixed vector. Unlike patch-based priors, ALP samples objects and their semantic factors autoregressively, with scene order encoded positionally. This enables object-wise compositional synthesis and implies superior generative efficiency; the number of tokens required is $O(NM)$ , decoupled from image size, and directly tied to scene semantic complexity.

Generative sampling proceeds by drawing factor-level codes for each object one by one, subsequently decoding them into scenes via the SVQ decoder.

Empirical Findings

NLoTM was benchmarked on 2D Sprites and 3D CLEVR variants, including texture-rich scenes. It exhibited the following properties:

Improved FID scores (≈40–85 for CLEVR variants) and higher Generation Accuracy in multi-object scene synthesis over VQ-VAE, dVAE, GENESIS-v2.
Superior out-of-distribution generalization in downstream odd-one-out and property comparison tasks; up to 99.1% OOD accuracy with SVQ codebook latent representations.
SVQ block-level quantization empirically outperformed naive slot-level quantization, confirming the theoretical hypothesis that factor-level representation is critical for complex scenes.
Segmentation (FG-ARI) competitive with SysBinder; significantly above vanilla slot attention.
Generative scaling demonstrated on Google Scanned Objects, with improved qualitative and FID upon model size increase.

The model performed robustly even with discrete bottlenecks in challenging datasets requiring recognition and compositional generation of complex object attributes.

Comparative Analysis

Patch-level VQ-VAEs fail to model global semantics and manifest blurry or malformed object syntheses as scene/textural complexity increases.
Increasing patch-based transformer prior capacity in dVAE marginally improves generative scores but cannot rival block-level SVQ performance.
Downstream OOD generalization requires latent representations that encode factor-level invariances without relying solely on discrete code indices—prototype vectors in SVQ satisfy this, yielding near-perfect OOD identification in relational tasks.

Implications, Limitations, and Future Directions

The NLoTM paradigm effectively bridges object-centric vision and symbolic modeling, aligning neural scene representations with Fodor-style mentalese abstractions. Factor-level discrete sampling and modular codebooks can be leveraged for new classes of generative models in planning, simulation, and concept manipulation. The model addresses the long-standing combinatorial explosion in object-centric discrete modeling and offers tractable density-based scene generation.

Limitations include current evaluation on synthetic scenes and absence of explicit continuous factor integration (position, pose). Future extensions should address realistic, high-resolution, naturalistic environments and hybridize discrete and continuous latent variables to model natural scenes more accurately. Incorporation of more sophisticated priors or hierarchical grammars could further improve generalization and abstraction capacities.

Ethical considerations include potential misuse for generating realistic fake images, necessitating the development of control mechanisms and responsible deployment protocols.

Conclusion

Neural Language of Thought Models provide an operational framework for unsupervised, object-centric, compositional discrete representation learning and scene generation, realizing LoT desiderata through slot-factor vector quantization and autoregressive object-property priors. This approach demonstrates measurable advances in compositional generalization, interpretability, and generative modeling efficiency, motivating future developments in structured neural reasoning and neuro-symbolic integration.