Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tri-Factor Disentanglement in Latent Representations

Updated 17 December 2025
  • Tri-factor disentanglement is the process of decomposing sequential data into two distinct static factors and one dynamic factor, enabling unsupervised learning of independent generative sources.
  • The methodology utilizes a Structured Koopman Autoencoder that applies linear latent dynamics and a spectral loss to enforce clear separation between static and dynamic components.
  • Quantitative evaluations on benchmarks, such as Sprite and MUG, demonstrate high accuracy in factor isolation and robust disentanglement performance.

Tri-factor disentanglement refers to the decomposition of observed sequential data, such as video frames, into three or more mutually independent semantic components in a latent representation: typically, two distinct static factors (static₁, static₂) and one dynamic factor. This methodology generalizes beyond the conventional dichotomy of static versus dynamic representations and enables unsupervised and interpretable modeling of complex datasets comprising multiple latent generative sources. The Structured Koopman Autoencoder framework provides the first fully unsupervised approach for such multifactor disentanglement by introducing a strong inductive bias in the form of linearly-evolving latent dynamics, operationalized through the Koopman operator perspective (Berman et al., 2023).

1. Latent Representation Organization

In tri-factor disentanglement using Structured Koopman @@@@1@@@@, a video sequence or time series of length TT with frames in Rm\mathbb{R}^m is encoded via a deep encoder χenc\chi_{\text{enc}} to yield a latent tensor ZRb×(T+1)×kZ \in \mathbb{R}^{b \times (T+1) \times k}, where bb is batch size and kk is total latent dimensionality. The kk-dimensional latent code at time jj, zjz_j, is explicitly partitioned as a concatenation: zj=[zj(s1)zj(s2)zj(d)]z_j = [ z_j^{(s_1)} \mid z_j^{(s_2)} \mid z_j^{(d)} ] with zj(s1)Rks1z_j^{(s_1)} \in \mathbb{R}^{k_{s_1}}, zj(s2)Rks2z_j^{(s_2)} \in \mathbb{R}^{k_{s_2}}, and zj(d)Rkdz_j^{(d)} \in \mathbb{R}^{k_d}, where k=ks1+ks2+kdk = k_{s_1} + k_{s_2} + k_d. In the Sprite benchmark, k=40k=40 is used with ks=ks1+ks2=8k_s = k_{s_1} + k_{s_2} = 8 static dimensions and kd=32k_d = 32 dynamic dimensions. For explicit tri-factorization, subsets such as ks1=3k_{s_1}=3 and ks2=5k_{s_2}=5 are assigned for distinct static factors, and the remainder for dynamics.

2. Koopman Latent Dynamics and Linearization Assumption

A foundational assumption is that, in the latent embedding, frame-to-frame evolution adheres to linear dynamics as prescribed by Koopman theory. Specifically, there exists a matrix CC such that

zt+1=Cztz_{t+1} = C z_t

where ztz_t is the latent code at timestep tt. This CC is computed for each batch via least-squares minimization. Defining ZpZ_p (bT×kb \cdot T \times k) as the stack of all “past” latents and ZfZ_f as all “future” ones, CC is obtained by

C=argminC^ZpC^ZfF2=Zp+ZfC = \operatorname{argmin}_{\hat{C}} \| Z_p \hat{C} - Z_f \|_F^2 = Z_p^+ Z_f

where Zp+Z_p^+ denotes the pseudoinverse. This enables the dynamics and statics in the latent space to be distinguished structurally via the spectral content of CC.

3. Spectral Loss and Factor Decomposition

To enforce separation between static and dynamic latent factors, a spectral penalty is imposed on the eigenvalues {λi}\{ \lambda_i \} of CC. Eigenvalues with λi=1\lambda_i = 1 correspond to static directions (no change across time), while λ1|\lambda| \neq 1 identifies dynamic modes. Letting the first ks=ks1+ks2k_s = k_{s_1} + k_{s_2} eigenvalues index static latent subspaces and the final kdk_d index dynamics, the loss has the structure: Lstat=1ksi=1ksλi12\mathcal{L}_{\text{stat}} = \frac{1}{k_s} \sum_{i=1}^{k_s} | \lambda_i - 1 |^2

Ldyn=1kdi=ks+1kξ(λi,ϵ)\mathcal{L}_{\text{dyn}} = \frac{1}{k_d} \sum_{i=k_s+1}^{k} \xi(|\lambda_i|, \epsilon)

Leig=Lstat+Ldyn\mathcal{L}_{\text{eig}} = \mathcal{L}_{\text{stat}} + \mathcal{L}_{\text{dyn}}

with ξ(r,ϵ)=max(rϵ,0)\xi(r, \epsilon) = \max(r - \epsilon, 0), and ϵ(0,1)\epsilon \in (0, 1) controlling the annular gap around λ=1\lambda=1 to force clear spectral separation. This design creates two disjoint subspaces in the spectrum of CC, corresponding precisely to static (clustered at +1+1) and dynamic (separated by margin ϵ\epsilon) factors.

4. Model and Training Procedure

The architecture consists of a deep convolutional encoder, differentiable Koopman module, and LSTM-based decoder. The encoder applies five strided convolutional layers (kernel size 4, stride 2, progressively increasing channels) each followed by BatchNorm and LeakyReLU, producing features that are mapped via a unidirectional LSTM to yield temporal codes. The Koopman module computes CC using batchwise SVD-based pseudoinverse of latent codes. Decoding is performed by an LSTM followed by five transposed convolutional layers, producing pixel outputs with a final sigmoid activation.

Three losses are optimized jointly:

  • Reconstruction loss: Lrec=MSE(Xrec,X)\mathcal{L}_{\text{rec}} = \mathrm{MSE}(X_{\text{rec}}, X)
  • Prediction (dynamics) loss: Lpred=MSE(Zf,ZpC)+MSE(Xf,χdec(ZpC))\mathcal{L}_{\text{pred}} = \mathrm{MSE}(Z_f, Z_p C) + \mathrm{MSE}(X_f, \chi_{\text{dec}}(Z_p C))
  • Spectral loss: Leig\mathcal{L}_{\text{eig}} as above

The total objective is

L=λrecLrec+λpredLpred+λeigLeig\mathcal{L} = \lambda_{\text{rec}} \mathcal{L}_{\text{rec}} + \lambda_{\text{pred}} \mathcal{L}_{\text{pred}} + \lambda_{\text{eig}} \mathcal{L}_{\text{eig}}

Empirically, λrec=15\lambda_{\text{rec}}=15, λpred=1\lambda_{\text{pred}}=1, λeig=1\lambda_{\text{eig}}=1, and ϵ=0.5\epsilon=0.5 yield effective disentanglement after \sim800 epochs with Adam optimization (learning rate 10310^{-3}). No additional regularization beyond the spectral term is needed, though small Gaussian noise or blur may be added to ZZ to regularize CC.

5. Post-hoc Identification and Swapping of Latent Factors

After training, CC is eigendecomposed as C=VΛV1C = V \Lambda V^{-1}, producing kk eigenvectors ϕi\phi_i. Any latent code zjz_j can be written as

zj=i=1kαjiϕiz_j = \sum_{i=1}^k \alpha_j^i \phi_i

where αji\alpha_j^i are projections onto dual eigenvectors. To empirically identify and validate the semantic content of static₁ and static₂, the static subspace (ksk_s dimensions) is partitioned into index sets I1I_1 (static₁) and I2I_2 (static₂) by inspecting which eigensubsets control specific factors (e.g., hair color vs. skin color). This can be automated via classifier-driven subset selection or performed manually for low-dimensional statics.

Under factorial swap, for any two sequences u,vu, v: z^j(u)=iI1αji(v)ϕi+iI1αji(u)ϕi\hat{z}_j(u) = \sum_{i \in I_1} \alpha_j^i(v) \phi_i + \sum_{i \notin I_1} \alpha_j^i(u) \phi_i which, after decoding, yields a sequence with static₁ taken from vv and static₂/dynamics preserved from uu. The isolation of factors is quantitatively measured using pretrained classifiers (e.g., hair-swap accuracy), yielding >90%>90\% static₁ accuracy with other factors at chance, as well as visualization via t-SNE embeddings which produce sharply separated 6×66 \times 6 grids corresponding to all factor combinations.

6. Quantitative and Qualitative Evaluation

The proposed model’s tri-factor disentanglement is validated on the Sprites dataset through several means:

Methodology Metric Observed Result
Hair swap accuracy Top-1 (judge network) 90.59%
Skin/motion after swap Top-1 (judge network) \sim16% (chance)
Two-factor benchmark Static accuracy (Sprites) 100%
Two-factor benchmark Inception/inter-entropy (Sprites) Best-in-class
Two-factor benchmark Static accuracy (MUG) >77%>77\%
Two-factor benchmark Intra-entropy (MUG) Best-in-class

Latent-space visualization confirms clear combinatorial clustering, indicating that the three factors are indeed disentangled.

7. Significance, Limitations, and Outlook

Tri-factor disentanglement with Structured Koopman Autoencoders operates fully unsupervised, requiring no labels, paired data, or contrastive losses—relying solely on the inductive bias that the sequence's true generative dynamics are linearizable in an appropriate latent space. The spectral penalty on the latent Koopman operator's eigenspectrum enables arbitrary multi-factor decomposition by simply allocating eigenvalue blocks for static versus dynamic factors and partitioning the static block post hoc. This approach accommodates extension to more than three factors and outperforms prior art on both qualitative and quantitative disentanglement benchmarks in unsupervised settings. A plausible implication is that linearly-structured representation learning may obviate the need for supervision in separating even complex, high-arity factor combinations—provided the basic linearization assumption is valid for the target dataset (Berman et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tri-factor Disentanglement.