Tri-Factor Disentanglement in Latent Representations
- Tri-factor disentanglement is the process of decomposing sequential data into two distinct static factors and one dynamic factor, enabling unsupervised learning of independent generative sources.
- The methodology utilizes a Structured Koopman Autoencoder that applies linear latent dynamics and a spectral loss to enforce clear separation between static and dynamic components.
- Quantitative evaluations on benchmarks, such as Sprite and MUG, demonstrate high accuracy in factor isolation and robust disentanglement performance.
Tri-factor disentanglement refers to the decomposition of observed sequential data, such as video frames, into three or more mutually independent semantic components in a latent representation: typically, two distinct static factors (static₁, static₂) and one dynamic factor. This methodology generalizes beyond the conventional dichotomy of static versus dynamic representations and enables unsupervised and interpretable modeling of complex datasets comprising multiple latent generative sources. The Structured Koopman Autoencoder framework provides the first fully unsupervised approach for such multifactor disentanglement by introducing a strong inductive bias in the form of linearly-evolving latent dynamics, operationalized through the Koopman operator perspective (Berman et al., 2023).
1. Latent Representation Organization
In tri-factor disentanglement using Structured Koopman @@@@1@@@@, a video sequence or time series of length with frames in is encoded via a deep encoder to yield a latent tensor , where is batch size and is total latent dimensionality. The -dimensional latent code at time , , is explicitly partitioned as a concatenation: with , , and , where . In the Sprite benchmark, is used with static dimensions and dynamic dimensions. For explicit tri-factorization, subsets such as and are assigned for distinct static factors, and the remainder for dynamics.
2. Koopman Latent Dynamics and Linearization Assumption
A foundational assumption is that, in the latent embedding, frame-to-frame evolution adheres to linear dynamics as prescribed by Koopman theory. Specifically, there exists a matrix such that
where is the latent code at timestep . This is computed for each batch via least-squares minimization. Defining () as the stack of all “past” latents and as all “future” ones, is obtained by
where denotes the pseudoinverse. This enables the dynamics and statics in the latent space to be distinguished structurally via the spectral content of .
3. Spectral Loss and Factor Decomposition
To enforce separation between static and dynamic latent factors, a spectral penalty is imposed on the eigenvalues of . Eigenvalues with correspond to static directions (no change across time), while identifies dynamic modes. Letting the first eigenvalues index static latent subspaces and the final index dynamics, the loss has the structure:
with , and controlling the annular gap around to force clear spectral separation. This design creates two disjoint subspaces in the spectrum of , corresponding precisely to static (clustered at ) and dynamic (separated by margin ) factors.
4. Model and Training Procedure
The architecture consists of a deep convolutional encoder, differentiable Koopman module, and LSTM-based decoder. The encoder applies five strided convolutional layers (kernel size 4, stride 2, progressively increasing channels) each followed by BatchNorm and LeakyReLU, producing features that are mapped via a unidirectional LSTM to yield temporal codes. The Koopman module computes using batchwise SVD-based pseudoinverse of latent codes. Decoding is performed by an LSTM followed by five transposed convolutional layers, producing pixel outputs with a final sigmoid activation.
Three losses are optimized jointly:
- Reconstruction loss:
- Prediction (dynamics) loss:
- Spectral loss: as above
The total objective is
Empirically, , , , and yield effective disentanglement after 800 epochs with Adam optimization (learning rate ). No additional regularization beyond the spectral term is needed, though small Gaussian noise or blur may be added to to regularize .
5. Post-hoc Identification and Swapping of Latent Factors
After training, is eigendecomposed as , producing eigenvectors . Any latent code can be written as
where are projections onto dual eigenvectors. To empirically identify and validate the semantic content of static₁ and static₂, the static subspace ( dimensions) is partitioned into index sets (static₁) and (static₂) by inspecting which eigensubsets control specific factors (e.g., hair color vs. skin color). This can be automated via classifier-driven subset selection or performed manually for low-dimensional statics.
Under factorial swap, for any two sequences : which, after decoding, yields a sequence with static₁ taken from and static₂/dynamics preserved from . The isolation of factors is quantitatively measured using pretrained classifiers (e.g., hair-swap accuracy), yielding static₁ accuracy with other factors at chance, as well as visualization via t-SNE embeddings which produce sharply separated grids corresponding to all factor combinations.
6. Quantitative and Qualitative Evaluation
The proposed model’s tri-factor disentanglement is validated on the Sprites dataset through several means:
| Methodology | Metric | Observed Result |
|---|---|---|
| Hair swap accuracy | Top-1 (judge network) | 90.59% |
| Skin/motion after swap | Top-1 (judge network) | 16% (chance) |
| Two-factor benchmark | Static accuracy (Sprites) | 100% |
| Two-factor benchmark | Inception/inter-entropy (Sprites) | Best-in-class |
| Two-factor benchmark | Static accuracy (MUG) | |
| Two-factor benchmark | Intra-entropy (MUG) | Best-in-class |
Latent-space visualization confirms clear combinatorial clustering, indicating that the three factors are indeed disentangled.
7. Significance, Limitations, and Outlook
Tri-factor disentanglement with Structured Koopman Autoencoders operates fully unsupervised, requiring no labels, paired data, or contrastive losses—relying solely on the inductive bias that the sequence's true generative dynamics are linearizable in an appropriate latent space. The spectral penalty on the latent Koopman operator's eigenspectrum enables arbitrary multi-factor decomposition by simply allocating eigenvalue blocks for static versus dynamic factors and partitioning the static block post hoc. This approach accommodates extension to more than three factors and outperforms prior art on both qualitative and quantitative disentanglement benchmarks in unsupervised settings. A plausible implication is that linearly-structured representation learning may obviate the need for supervision in separating even complex, high-arity factor combinations—provided the basic linearization assumption is valid for the target dataset (Berman et al., 2023).