Probabilistic Slot Attention Models

Updated 4 February 2026

Probabilistic Slot Attention is an object-centric framework that integrates EM-inspired Gaussian mixture updates to enable uncertainty-aware object representations.
It employs differentiable E- and M-steps to update slot means, variances, and mixture weights, improving identifiability and clustering over traditional Slot Attention.
Empirical evaluations show that PSA outperforms deterministic methods on benchmarks like CLEVR by enhancing both generative quality and slot coherence.

Probabilistic Slot Attention (PSA) refers to a family of object-centric representation learning frameworks that generalize standard Slot Attention by embedding it within an explicit probabilistic (typically Gaussian mixture-based) inference and generative paradigm. Probabilistic Slot Attention encompasses models that augment or replace the deterministic k-means-like clustering of Slot Attention with EM-inspired updates, Gaussian mixture model (GMM) uncertainty, and aggregate probabilistic priors, thereby yielding more expressive, identifiable, and theoretically grounded slot-based object-factorizations of complex visual scenes (Kori et al., 2024, Wang et al., 2023, Kirilenko et al., 2023).

1. Probabilistic Slot Attention Formulation

Probabilistic Slot Attention models encode an image $x \in \mathbb{R}^{H\times W \times C}$ into a set of $N$ feature vectors $z = \{z_n\}_{n=1}^N$ , which are then clustered into $K$ object-centric slot representations $\{s_k\}_{k=1}^K$ via a learnable, differentiable variant of Expectation-Maximization for GMMs. The core probabilistic structure posits

$p(z|\Theta) = \prod_{n=1}^N \sum_{k=1}^K \pi_k\, \mathcal{N}(z_n;\mu_k, \Sigma_k)$

with $\Theta = \{\pi_k, \mu_k, \Sigma_k\}_{k=1}^K$ denoting mixture weights, means, and covariances. PSA generalizes vanilla Slot Attention by maintaining not only slot means, but also variances and weights, updating all via a differentiable EM-like procedure over $T$ iterations:

E-step: Compute responsibilities (soft-assignments, interpreted as attentions) using current slots.
M-step: Re-estimate $\pi_k$ , $\mu_k$ , and $\Sigma_k$ based on attentions and feature values.
Differentiability: All operations (softmax, summation, GRU or MLP updates) are differentiable, allowing gradient-based learning from downstream objectives.

This probabilistic loop enables slot representations to encode both object prototypes and statistical uncertainty, with automatic relevance determination via learned mixture weights that naturally prune unused slots (Kori et al., 2024, Kirilenko et al., 2023).

2. Generative Modeling and ELBO Training

PSA can serve as a building block for hierarchical probabilistic generative models. In Slot-VAE (Wang et al., 2023), slot attention is embedded within a VAE framework to generate novel scenes. The generative process is defined hierarchically:

Global latent $z_g \sim \mathcal{N}(0,I)$ encodes high-level scene features.
Per-object slots $z_k \sim \mathcal{N}(\mu_k(z_g), \Sigma_k(z_g))$ model object appearance and attributes, with parameters produced by a shared MLP from $z_g$ .
The image $x$ is generated conditionally on all object slots, mixing object decodings via spatial masks:

$p(x|z_{1:K}) = \prod_{i,j} \mathcal{N}\left( x_{i,j}; \sum_k \alpha_{i,j}^k \rho_k(z_k), \sigma^2 I \right)$

The learning objective maximizes the ELBO:

$\mathcal{L}_{\rm ELBO} = \mathbb{E}_{q(z_g, z_{1:K}|x)}[\log p(x|z_{1:K})] - \mathrm{KL}(q(z_g|x) \| p(z_g)) - \sum_{k=1}^K \mathrm{KL}(q(z_k|x) \| p(z_k|z_g))$

This structure enforces scene-level coherence and permutational alignment between inference and generation, resolving standard slot permutation ambiguities (Wang et al., 2023).

3. Inference, EM-style Updates, and Attention Mechanisms

The inference process in PSA follows an attention-augmented probabilistic EM paradigm. For each image, EM iterations proceed as:

Initialization: Slots initialized from standard Gaussian or learned states, mixture weights uniform.
E-step (Attention Update): Compute slot responsibilities as normalized GMM likelihoods or their neural parameterizations:

$A_{n k} = \frac{\pi_k^{(t)} \, \mathcal{N}(k_n; W_q \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_j \pi_j^{(t)} \, \mathcal{N}(k_n; W_q \mu_j^{(t)}, \Sigma_j^{(t)})}$

M-step: Slot means and variances are updated as responsibility-weighted averages of feature values; mixture weights updated by average responsibilities.
Slot Representation: Final slot parameters (means and variances, optionally concatenated and projected) are used as object-centric codes for downstream tasks or generative decoding.

The slot mixture module (SMM) (Kirilenko et al., 2023) provides a related GMM-based update with additional GRU/MLP refinement, allowing the propagation of slot uncertainties and enabling richer slot representations compared to deterministic Slot Attention.

4. Identifiability Guarantees and Theoretical Properties

A key advance of recent PSA frameworks is their theoretical analysis of slot identifiability. By imposing an aggregate GMM prior over slot representations, PSA establishes that—under mild assumptions of non-degeneracy and weak injectivity of the decoder—the object-centric slot parameters are provably identifiable up to an equivalence class (slot permutation, affine transforms, and translation):

The aggregate posterior $q(z)$ over slots is a global GMM, formed by averaging local GMMs from each dataset example.
PSA's marginal model $p_\theta(x) = \int p(x|s) q(s)\,ds$ is identified up to permutation and slot-wise affine transformation, formalizing when discovered slot representations are unique and reliable (Kori et al., 2024).

This result distinguishes PSA from earlier methods, which lacked such theoretical identifiability guarantees.

5. Empirical Evaluation and Benchmark Results

Empirical studies on canonical object-centric datasets such as CLEVR, ObjectsRoom, ClevrTex, and SpriteWorld demonstrate that probabilistic slot attention variants outperform deterministic Slot Attention and alternative baselines in both identifiability and object-centric metrics:

Method	CLEVR SMCC↑	CLEVR FG-ARI↑	CLEVR FID↓	ObjRoom SMCC↑
SlotAttn	0.56±0.02	0.96±0.01	41.8±2.8	0.46±0.01
PSA	0.58±0.06	0.85±0.02	52.7±1.7	0.59±0.01
PSA-Proj	0.61±0.06	0.95±0.00	39.4±8.3	0.59±0.00

Slot identifiability: Mean Pearson correlation (SMCC) and Slot Identifiability Score (SIS) favor PSA, especially on synthetic and real-world composite scenes.
Generation quality: Scene-structure accuracy, FID, and ARI-FG indicate that PSA maintains or exceeds the reconstruction and compositional scene fidelity of prior slot-based models, supporting the use of more expressive, probabilistic slots (Kori et al., 2024, Kirilenko et al., 2023, Wang et al., 2023).

6. Variants, Extensions, and Comparative Analysis

Multiple probabilistic slot modules have been explored:

Slot-VAE (Wang et al., 2023): Integrates (deterministic) Slot Attention within a VAE, turning slots into random variables with priors conditioned on a global latent, enabling generative scene modeling.
Slot Mixture Module (SMM) (Kirilenko et al., 2023): Adopts a neuralized GMM clustering procedure, updating both slot means and variances, and propagates uncertainty into downstream tasks with improved segmentation and set property prediction.
Identifiable PSA (Kori et al., 2024): Formalizes the EM/GMM update at the heart of Slot Attention, introduces a well-defined aggregate mixture prior, and provides identifiability guarantees with empirical validation.

Distinct from deterministic Slot Attention—which approximates a soft $k$ -means without slot variances or probabilistic priors—these probabilistic extensions enrich slot representations with uncertainty and structure, improve robustness to varying object counts, and enable principled comparison and matching of object-centric features across samples and models.

7. Practical Implications and Future Directions

The adoption of probabilistic slot attention mechanisms accelerates progress toward modular, interpretable, and identifiable object-centric scene decompositions. The identifiability theory underlying PSA (Kori et al., 2024) suggests a principled foundation for scaling slot-based models to complex, high-dimensional data. A plausible implication is that future research will refine mixture prior structures, explore more expressive slot-object interactions, and further align probabilistic clustering objectives with the requirements of downstream generative modeling and reasoning tasks. Empirical results show persistent gains in both slot identifiability and compositional generation fidelity across challenging visual benchmarks, indicating the practical significance of the probabilistic slot paradigm.