Label-Conditional GMVAE: Structured Generative Model

Updated 7 February 2026

Label-Conditional GMVAE is a variational autoencoder that uses a multi-centered Gaussian mixture prior conditioned on labels to generate class-discriminative latent representations.
The model employs dual encoding paths for cluster assignment and latent parameter estimation, ensuring that both the generative and inference processes explicitly incorporate label information.
L-GMVAE supports robust discriminative modeling, controllable sample generation, and applications like counterfactual explanations, open-set recognition, and interpretable embeddings.

A Label-Conditional Gaussian Mixture Variational Autoencoder (L-GMVAE) is a variational inference model extending the standard VAE by placing a label-conditional, multi-centered Gaussian mixture prior on the latent space, tightly integrating supervised information to induce structured, clusterable, and class-discriminative representations. Each class is associated with one or more Gaussian mixture components in the latent space, and both the generative and inference processes specifically condition on class label information. L-GMVAEs support robust discriminative modeling, controllable sample generation, disentangled factors, and serve as a foundation for high-quality counterfactual explanations, open-set recognition, interpretable latent embeddings, and label-driven generative modeling.

1. Probabilistic Foundations and Model Specification

The core L-GMVAE framework extends the standard VAE structure by associating each class label $y \in \{1,\ldots,L \}$ with a subset of mixture components $\mathcal{C}_y$ in the latent space $\mathbb{R}^h$ , with each input $x \in \mathbb{R}^d$ and its corresponding label $y$ generating a hierarchical latent representation ( $c, z$ ). The generative process is: $p(x, c, z \mid y) = p(c \mid y)\, p(z \mid c)\, p(x \mid z)$ where:

$p(c \mid y) = \frac{1}{K_y}$ for $c \in \mathcal{C}_y$ and zero otherwise,
$p(z \mid c) = \mathcal{N}(z; \mu_z(c), \Sigma_z(c))$ is the class-conditional Gaussian prior,
$p(x \mid z) = \mathcal{N}(x; \mu_x(z), \text{diag}(\sigma_x^2(z)))$ defines a neural decoder.

The inference model is factorized as $q(c, z \mid x, y) = q(c \mid x, y)\, q(z \mid x, c, y)$ , where:

$q(c \mid x, y)$ is a categorical distribution over clusters within class $y$ ,
$q(z \mid x, c, y)$ is a normal with parameters output by an encoder network.

The evidence lower bound (ELBO) for a labeled instance ( $x, y$ ) is: $L_{\rm ELBO}(x,y) = \mathbb{E}_{q(c, z \mid x, y)}[\log p(x \mid z)] - \mathbb{E}_{q(c \mid x, y)}[ \mathrm{KL}(q(z \mid x, c, y)\|p(z \mid c)) ] - \mathrm{KL}(q(c \mid x, y)\|p(c \mid y))$ No auxiliary losses are required for convergence or cluster separation; the regularization effects are implicit in cluster-assignment and latent prior KL terms (Jiang et al., 6 Oct 2025).

2. Architecture, Optimization, and Implementation

The most common architecture for tabular and moderately-sized data involves:

Encoder: Two-headed MLP with ReLU, one branch for cluster assignment $q(c \mid x, y)$ (input: $x \oplus \text{onehot}(y)$ ), one for latent parameters $q(z \mid x, c, y)$ (input: $x \oplus \text{onehot}(y) \oplus \text{onehot}(c)$ ).
Latent prior: Parameterized networks outputting $\mu_z(c), \log \sigma_z(c)$ for each cluster.
Decoder: MLP mapping latent $z$ to output mean $\mu_x(z)$ and variance $\sigma_x^2(z)$ .

Key hyperparameters include the number of clusters per class ( $K_y$ ), latent dimensionality $h$ , learning rates, batch sizes, and relative loss weights (see Table 1).

Dataset	Clusters (K_y)	Latent Dim (h)	LR	ELBO Weights (KL_c, KL_z, Recon)
heloc	5	18 / 15	1e-3	(0.1, 0.1, 1)
wine	5	8	1e-3/3e-4	(0.1/0.05/1) / (0.5/0.3/1)
adult	5	10	1e-3	(0.4, 0.2, 1)
compas	5	5/6	5e-4	(0.5, 0.3, 1)

All layers are initialized via Xavier uniform and optimized with Adam. Early stopping is based on the validation ELBO (Jiang et al., 6 Oct 2025).

3. Theoretical Effects of Label Conditioning

Label conditioning partitions latent space such that each class occupies disjoint subsets associated with separate Gaussian components, leading to improved representation compactness, explicit cluster separation, and a high degree of label–latent mutual information. This structure enables:

Direct control over generation by selecting mixture centroids,
Robustness to input and model variation (since each class maps to clearly defined latent prototypes),
Maximization of the mutual information $I(z; y)$ between the label and representation.

In the case of multi-factor models (as in (Zheng et al., 2018)), the label-relevant factor is Gaussian-mixture–distributed and label-irrelevant factors have an independent standard Normal prior, further enhancing disentanglement by adversarially discouraging leakage of class information into the unlabeled subspace.

4. Counterfactual Explanations and the LAPACE Algorithm

L-GMVAE forms the backbone of the LAPACE (“Latent Path Counterfactual Explanations”) method (Jiang et al., 6 Oct 2025), designed for generating robust, plausible, diverse, and actionability-aware counterfactuals. For an input $x$ of class $y$ and target class $y'$ , LAPACE constructs a path in latent space from the current encoding $z_0$ to each class- $y'$ centroid $\mu_{c_j}$ (with $c_j \in \mathcal{C}_{y'}$ ), linearly interpolating: $z(t) = (1-t) z_0 + t \mu_{c_j}; \quad t \in [0, 1]$ Decoding these points yields a parametric curve of plausible counterfactual points.

Actionability constraints $g_k(x) \leq 0$ are enforced on each point via lightweight gradient updates in latent space: $z_\tau \leftarrow z_\tau - \eta \nabla_z \left[ \sum_k \text{ReLU}(g_k(\text{Dec}(z_\tau))) \right]$ This framework decouples the counterfactual path construction from the classifier or underlying data, making it model-agnostic and computationally efficient.

Empirical results show that this approach achieves:

100% centroid accuracy and perfect robustness for LAPACE-Last (CEs at centroids),
Best-in-class plausibility (lowest LOF), high validity, and competitive diversity across all tested tabular datasets (Jiang et al., 6 Oct 2025).

5. Applications Beyond Counterfactuals

The L-GMVAE paradigm admits broader utility:

Physically interpretable representations: Turbulence flow fields are clustered in 2D latent space according to Reynolds number, and latent traversals correspond to physically smooth transitions; spectral analysis on the latent graph quantifies physical smoothness (Fan et al., 31 Jan 2025).
Open-set recognition: By training L-GMVAE with per-class mixtures and then using latent centroids for thresholding, open-set F1 improves by up to 29.5% over non-mixture baselines. Within-class embeddings are significantly more compact—simple centroid-based open-set classifiers outperform EVT or softmax-based methods (Cao et al., 2020).
Source separation: For audio mixtures, L-GMVAE with label-conditional priors yields up to 2 dB improvement in SDR and similar gains in related metrics, over purely NMF-based or label-unaware models, due to better matching of speaker-specific structure (Seki et al., 2018).
Disentanglement: In computer vision, splitting the latent variables into label-relevant mixtures and label-irrelevant spaces, enforced via additional supervision and adversarial regularizers, strongly encourages factorized and interpretable representations, as supported by empirical evaluation (Zheng et al., 2018).

L-GMVAE differs from:

Standard VAE: VAE employs a single standard Normal prior for all classes, lacks explicit class separation in latent space.
Conditional VAE (cVAE): cVAE conditions both encoder and decoder on class labels but still uses a unimodal latent prior, resulting in possible mode collapse and entanglement.
Unsupervised GMVAE: Unsupervised variants lack label-driven partitioning and rely on EM or nonparametric techniques for component assignment; L-GMVAE’s label-driven clusters yield better disentanglement and mutual information (Zheng et al., 2018).

A common misconception is that label conditioning only aids downstream classification; in practice, it enables generation of more diverse and realistic samples, interpretable traversals corresponding to semantic or physical variation, and is essential for robust out-of-distribution recognition (Cao et al., 2020, Zheng et al., 2018).

7. Evaluation Metrics and Empirical Outcomes

Empirical assessment is multi-faceted, covering validity, proximity (L1 distance), plausibility (LOF), diversity, robustness to both input/model perturbations, and computational efficiency (Jiang et al., 6 Oct 2025).

For counterfactual explanation: L-GMVAE+LAPACE yields ≤3% utility gap in synthetic-to-real transfer on continuous datasets, with robust CEs under model/input changes.
For open-set recognition: Conditioning the GMVAE prior by label and employing subclusters per class results in compact class embeddings and high F1.
For physical modeling: The latent manifold captures physically meaningful transitions, with high spectral smoothness scores correlating to global smoothness in physical observables (Fan et al., 31 Jan 2025).

These findings substantiate the L-GMVAE architecture as a unified, efficient framework for high-dimensional, supervised generative modeling and manifold exploration, with demonstrated superiority across multiple domains and evaluation criteria.