Tri-Factor Disentanglement in Latent Representations

Updated 17 December 2025

Tri-factor disentanglement is the process of decomposing sequential data into two distinct static factors and one dynamic factor, enabling unsupervised learning of independent generative sources.
The methodology utilizes a Structured Koopman Autoencoder that applies linear latent dynamics and a spectral loss to enforce clear separation between static and dynamic components.
Quantitative evaluations on benchmarks, such as Sprite and MUG, demonstrate high accuracy in factor isolation and robust disentanglement performance.

Tri-factor disentanglement refers to the decomposition of observed sequential data, such as video frames, into three or more mutually independent semantic components in a latent representation: typically, two distinct static factors (static₁, static₂) and one dynamic factor. This methodology generalizes beyond the conventional dichotomy of static versus dynamic representations and enables unsupervised and interpretable modeling of complex datasets comprising multiple latent generative sources. The Structured Koopman Autoencoder framework provides the first fully unsupervised approach for such multifactor disentanglement by introducing a strong inductive bias in the form of linearly-evolving latent dynamics, operationalized through the Koopman operator perspective (Berman et al., 2023).

1. Latent Representation Organization

In tri-factor disentanglement using Structured Koopman @@@@1@@@@, a video sequence or time series of length $T$ with frames in $\mathbb{R}^m$ is encoded via a deep encoder $\chi_{\text{enc}}$ to yield a latent tensor $Z \in \mathbb{R}^{b \times (T+1) \times k}$ , where $b$ is batch size and $k$ is total latent dimensionality. The $k$ -dimensional latent code at time $j$ , $z_j$ , is explicitly partitioned as a concatenation: $z_j = [ z_j^{(s_1)} \mid z_j^{(s_2)} \mid z_j^{(d)} ]$ with $z_j^{(s_1)} \in \mathbb{R}^{k_{s_1}}$ , $z_j^{(s_2)} \in \mathbb{R}^{k_{s_2}}$ , and $z_j^{(d)} \in \mathbb{R}^{k_d}$ , where $k = k_{s_1} + k_{s_2} + k_d$ . In the Sprite benchmark, $k=40$ is used with $k_s = k_{s_1} + k_{s_2} = 8$ static dimensions and $k_d = 32$ dynamic dimensions. For explicit tri-factorization, subsets such as $k_{s_1}=3$ and $k_{s_2}=5$ are assigned for distinct static factors, and the remainder for dynamics.

2. Koopman Latent Dynamics and Linearization Assumption

A foundational assumption is that, in the latent embedding, frame-to-frame evolution adheres to linear dynamics as prescribed by Koopman theory. Specifically, there exists a matrix $C$ such that

$z_{t+1} = C z_t$

where $z_t$ is the latent code at timestep $t$ . This $C$ is computed for each batch via least-squares minimization. Defining $Z_p$ ( $b \cdot T \times k$ ) as the stack of all “past” latents and $Z_f$ as all “future” ones, $C$ is obtained by

$C = \operatorname{argmin}_{\hat{C}} \| Z_p \hat{C} - Z_f \|_F^2 = Z_p^+ Z_f$

where $Z_p^+$ denotes the pseudoinverse. This enables the dynamics and statics in the latent space to be distinguished structurally via the spectral content of $C$ .

3. Spectral Loss and Factor Decomposition

To enforce separation between static and dynamic latent factors, a spectral penalty is imposed on the eigenvalues $\{ \lambda_i \}$ of $C$ . Eigenvalues with $\lambda_i = 1$ correspond to static directions (no change across time), while $|\lambda| \neq 1$ identifies dynamic modes. Letting the first $k_s = k_{s_1} + k_{s_2}$ eigenvalues index static latent subspaces and the final $k_d$ index dynamics, the loss has the structure: $\mathcal{L}_{\text{stat}} = \frac{1}{k_s} \sum_{i=1}^{k_s} | \lambda_i - 1 |^2$

$\mathcal{L}_{\text{dyn}} = \frac{1}{k_d} \sum_{i=k_s+1}^{k} \xi(|\lambda_i|, \epsilon)$

$\mathcal{L}_{\text{eig}} = \mathcal{L}_{\text{stat}} + \mathcal{L}_{\text{dyn}}$

with $\xi(r, \epsilon) = \max(r - \epsilon, 0)$ , and $\epsilon \in (0, 1)$ controlling the annular gap around $\lambda=1$ to force clear spectral separation. This design creates two disjoint subspaces in the spectrum of $C$ , corresponding precisely to static (clustered at $+1$ ) and dynamic (separated by margin $\epsilon$ ) factors.

4. Model and Training Procedure

The architecture consists of a deep convolutional encoder, differentiable Koopman module, and LSTM-based decoder. The encoder applies five strided convolutional layers (kernel size 4, stride 2, progressively increasing channels) each followed by BatchNorm and LeakyReLU, producing features that are mapped via a unidirectional LSTM to yield temporal codes. The Koopman module computes $C$ using batchwise SVD-based pseudoinverse of latent codes. Decoding is performed by an LSTM followed by five transposed convolutional layers, producing pixel outputs with a final sigmoid activation.

Three losses are optimized jointly:

Reconstruction loss: $\mathcal{L}_{\text{rec}} = \mathrm{MSE}(X_{\text{rec}}, X)$
Prediction (dynamics) loss: $\mathcal{L}_{\text{pred}} = \mathrm{MSE}(Z_f, Z_p C) + \mathrm{MSE}(X_f, \chi_{\text{dec}}(Z_p C))$
Spectral loss: $\mathcal{L}_{\text{eig}}$ as above

The total objective is

$\mathcal{L} = \lambda_{\text{rec}} \mathcal{L}_{\text{rec}} + \lambda_{\text{pred}} \mathcal{L}_{\text{pred}} + \lambda_{\text{eig}} \mathcal{L}_{\text{eig}}$

Empirically, $\lambda_{\text{rec}}=15$ , $\lambda_{\text{pred}}=1$ , $\lambda_{\text{eig}}=1$ , and $\epsilon=0.5$ yield effective disentanglement after $\sim$ 800 epochs with Adam optimization (learning rate $10^{-3}$ ). No additional regularization beyond the spectral term is needed, though small Gaussian noise or blur may be added to $Z$ to regularize $C$ .

5. Post-hoc Identification and Swapping of Latent Factors

After training, $C$ is eigendecomposed as $C = V \Lambda V^{-1}$ , producing $k$ eigenvectors $\phi_i$ . Any latent code $z_j$ can be written as

$z_j = \sum_{i=1}^k \alpha_j^i \phi_i$

where $\alpha_j^i$ are projections onto dual eigenvectors. To empirically identify and validate the semantic content of static₁ and static₂, the static subspace ( $k_s$ dimensions) is partitioned into index sets $I_1$ (static₁) and $I_2$ (static₂) by inspecting which eigensubsets control specific factors (e.g., hair color vs. skin color). This can be automated via classifier-driven subset selection or performed manually for low-dimensional statics.

Under factorial swap, for any two sequences $u, v$ : $\hat{z}_j(u) = \sum_{i \in I_1} \alpha_j^i(v) \phi_i + \sum_{i \notin I_1} \alpha_j^i(u) \phi_i$ which, after decoding, yields a sequence with static₁ taken from $v$ and static₂/dynamics preserved from $u$ . The isolation of factors is quantitatively measured using pretrained classifiers (e.g., hair-swap accuracy), yielding $>90\%$ static₁ accuracy with other factors at chance, as well as visualization via t-SNE embeddings which produce sharply separated $6 \times 6$ grids corresponding to all factor combinations.

6. Quantitative and Qualitative Evaluation

The proposed model’s tri-factor disentanglement is validated on the Sprites dataset through several means:

Methodology	Metric	Observed Result
Hair swap accuracy	Top-1 (judge network)	90.59%
Skin/motion after swap	Top-1 (judge network)	$\sim$ 16% (chance)
Two-factor benchmark	Static accuracy (Sprites)	100%
Two-factor benchmark	Inception/inter-entropy (Sprites)	Best-in-class
Two-factor benchmark	Static accuracy (MUG)	$>77\%$
Two-factor benchmark	Intra-entropy (MUG)	Best-in-class

Latent-space visualization confirms clear combinatorial clustering, indicating that the three factors are indeed disentangled.

7. Significance, Limitations, and Outlook

Tri-factor disentanglement with Structured Koopman Autoencoders operates fully unsupervised, requiring no labels, paired data, or contrastive losses—relying solely on the inductive bias that the sequence's true generative dynamics are linearizable in an appropriate latent space. The spectral penalty on the latent Koopman operator's eigenspectrum enables arbitrary multi-factor decomposition by simply allocating eigenvalue blocks for static versus dynamic factors and partitioning the static block post hoc. This approach accommodates extension to more than three factors and outperforms prior art on both qualitative and quantitative disentanglement benchmarks in unsupervised settings. A plausible implication is that linearly-structured representation learning may obviate the need for supervision in separating even complex, high-arity factor combinations—provided the basic linearization assumption is valid for the target dataset (Berman et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Multifactor Sequential Disentanglement via Structured Koopman Autoencoders (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tri-factor Disentanglement.