Symmetry-Conditioned VAE
- Symmetry-Conditioned VAE is an unsupervised model that leverages learnable group symmetries to align latent factors without requiring factor labels.
- It utilizes an extended ELBO combined with algebraic and equivariance regularizations to enforce disentangled, axis-aligned latent representations.
- The approach employs a learnable symmetry codebook and composite group operators to achieve multi-factor disentanglement, evaluated via the m-FVMₖ metric.
A Symmetry-Conditioned Variational Autoencoder (CVAE), as defined in the Composite Factor-Aligned Symmetry Learning (CFASL) framework, is a class of unsupervised generative latent variable models in which disentanglement emerges from explicit, learnable group symmetries acting on the latent space. Unlike prior approaches that require factor labels or known generative factors, the CFASL methodology enables a VAE to discover and align latent dimensions with independent generative symmetries by integrating a suite of algebraic regularization and equivariance conditions, operationalized entirely through a learnable symmetry codebook and composite group actions inferred from pairs of data samples (Jung et al., 2024).
1. Loss Augmentation: ELBO with Symmetry and Equivariance Constraints
The foundational loss function of the CVAE is an extended evidence lower bound (ELBO), augmented with several families of regularization and alignment losses. The total loss is: where each is a hyper-parameter and the terms encode algebraic and statistical constraints aligned with symmetry discovery. and enforce group equivariance in the encoder and decoder, respectively. This composite loss systematically injects inductive bias for the emergence of factor-aligned, axis-aligned, and group-structured latent representations.
2. Learnable Symmetry Codebook
The central algebraic structure is the symmetry codebook . For latent dimension , a library forms a basis for the relevant Lie algebra. The codebook is partitioned into sections, one per postulated generative factor: with each , and parameterized as trainable combinations of basis elements. Group elements are recovered via the matrix exponential . Each section mediates transformations corresponding to a candidate latent factor, enabling unsupervised discovery and alignment of symmetries with latent dimensions through explicit parametrization and subsequent algebraic regularization.
3. Composite Symmetry Operators and Factor Selection
To capture multi-factor generative changes, CFASL infers composite group elements from paired data. Given images , their posterior parameters are concatenated to form a comparison vector . An attention mechanism operates within each section to interpolate between the available generators, producing a composite direction . Sectionwise on/off selection is implemented with Gumbel-Softmax switches , informed by pseudo-labels derived from latent coordinate differences, so that only factors undergoing change are activated. The resulting group element is
and maps to an estimate of . This construction enables compositionality, equivariance, and data-driven selection of which factors contribute to a given sample transformation, without explicit knowledge of ground-truth factors.
4. Encoder and Decoder Equivariance Integration
The CVAE architecture is structurally conventional, employing standard convolutional encoders and decoders (or optionally, a Spatial-Broadcast decoder). Uniquely, group equivariance conditions are imposed:
where is the (unknown) input-space symmetry corresponding to . These constraints are enforced softly via mean-square penalties ( on the latent and on the decoded reconstructions). Data pairs are supplied to the same encoder and decoder parameterization, eschewing the need for auxiliary towers or siamese computation streams. This design induces equivariance between input and latent/decoded transformations.
5. Algebraic Regularization and Disentanglement Losses
The symmetry codebook and factor organization are enforced with several loss terms:
- Commutativity (): All learned Lie algebra elements are constrained to commute, ensuring group compositionality: .
- Parallelism within factors (): Generators within the same factor section are driven to be parallel.
- Orthogonality across factors (): Sections corresponding to different factors are incentivized to be orthogonal.
- Sparsity / Axis-Alignment (): Each one-parameter subgroup is regularized to move only one latent coordinate, establishing axis-alignment.
- Factor prediction (): The section on/off mechanism is trained with cross-entropy against pseudo-labels from latent means.
- Encoder and Decoder Equivariance (): Penalize deviations from algebraic group action equivariance in both encoding and decoding.
Collectively, these losses effect a disentangled, group-structured latent representation aligned with independent generative axes, without recourse to supervision or labeled factor information.
6. Training Procedure
The CFASL training protocol for the CVAE is as follows:
- Sample a minibatch of data .
- Pair samples into .
- For each pair:
- Encode .
- Draw latent codes from respective posteriors.
- Compute attention, switches, and aggregate composite symmetry.
- Decode reconstructions and apply symmetry transformation in latent.
- Compute all terms in the augmented loss as above.
- Update all model and codebook parameters by backpropagation.
- Iterate until convergence.
The method is designed to induce symmetry-based disentanglement in fully unsupervised fashion, without ever specifying factor names or identities during training (Jung et al., 2024).
7. Multi-Factor Disentanglement Evaluation: m-FVM
To quantify multi-factor disentanglement, an extended evaluation metric, multi-factor fixed-variance match (m-FVM), generalizes the Factor-VAE metric to scenarios where factors are held fixed. For each sub-experiment:
- factors are held constant.
- Minibatches are constructed so that these factors do not vary, the standard deviation of each latent coordinate is computed, and the smallest-variance coordinates are compared to the known fixed factors.
- The score tallies coincidences with ground-truth over all factor combinations and training epochs.
This metric rigorously measures disentanglement performance when controlling for multiple simultaneous generative axes and confirms that the symmetry-conditioned VAE recovers latent disentanglement under both single- and multi-factor variations (Jung et al., 2024).