CPVQ-VAE for Class-Consistent Point Clouds

Updated 25 January 2026

The paper introduces CPVQ-VAE, which partitions the latent codebook into class-specific bins to ensure geometrically and semantically accurate point cloud generation.
It employs class-aware maintenance and a dual-stage encoding-decoding process to counteract codebook collapse and misclassification errors.
Empirical results reveal significant reductions in Chamfer Distance and Point2Mesh Error compared to standard VAEs and diffusion-based methods.

The Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) is a generative model architecture developed to produce class-consistent point cloud objects from latent features, enabling direct point cloud scene generation without reliance on external object retrieval databases. CPVQ-VAE addresses failure modes observed when conventional autoencoders or diffusion-based latent decoding yield incorrect object geometries with mismatched classes. By explicitly partitioning the latent codebook into class-labeled bins and employing class-aware codebook maintenance, CPVQ-VAE reliably maps generated latents to point cloud shapes matching the intended object category, achieving significant reductions in geometrical and semantic reconstruction errors (Edirimuni et al., 18 Jan 2026).

1. Architecture and Codebook Partitioning

CPVQ-VAE extends the standard VQ-VAE framework through a dual mechanism: partitioning the codebook into class-specific bins and actively maintaining codebook utilization to counteract codebook collapse. The model operates over point cloud inputs $P \in \mathbb{R}^{N_P \times 3}$ , with $N_P=2025$ points. The encoder $\mathfrak{E}$ utilizes a PointNet++-style architecture consisting of three set-abstraction layers, feature MLPs, and global max-pooling, mapping $P$ to a 128-dimensional latent vector $z_e$ . Quantization snaps $z_e$ to the nearest codevector within the designated class partition, yielding $z_q \in \mathbb{R}^{128}$ . The decoder $\mathfrak{D}$ adopts a FoldingNet-inspired strategy: a fixed $45 \times 45$ 2D grid is concatenated with $z_q$ and passed through two “folding” MLP layers to generate the reconstructed point cloud $\tilde{P} \in \mathbb{R}^{N_P \times 3}$ .

Let $N_c$ denote the number of object classes and $N_q$ the number of discrete codevectors per class, so the total codebook size is $N_K = N_c \times N_q$ . Codevectors $e^k$ are assigned to contiguous class-specific blocks. During quantization, for an object of class $c$ , only codevectors in block $c$ are considered.

2. Training Objectives and Codebook Maintenance

CPVQ-VAE’s optimization minimizes three terms comprising the VQ-VAE loss:

Autoencoder reconstruction via Chamfer Distance:

$L_{CD}(P,\tilde{P}) = \frac{1}{|P|} \sum_{p \in P} \min_{q \in \tilde{P}} \| p - q \|^2 + \frac{1}{|\tilde{P}|} \sum_{q \in \tilde{P}} \min_{p \in P} \| q - p \|^2$

Quantization loss: $\| \operatorname{sg}[z_e] - z_q \|_2^2$
Commitment loss: $\beta \| z_e - \operatorname{sg}[z_q] \|_2^2$ , with experimental coefficients $\lambda_{CD}=10$ and $\beta=1$ .

To resolve codebook collapse (large swathes of codevectors unused), CPVQ-VAE uses a class-aware running-average update. For each codevector $e^k$ , a usage statistic $U_s^k$ is maintained:

$U_s^k = \gamma U_{s-1}^k + (1-\gamma)\frac{u_s^k}{B}$

where $u_s^k$ approximates usage in batch of size $B$ , and $\gamma=0.99$ . If $U_s^k$ falls below a threshold, $e^k$ is reinitialized towards the nearest encoding in the batch:

$\alpha_s^k = \exp(-10 \cdot U_s^k \frac{N_q}{1-\gamma} - \epsilon)$

$e_s^k \leftarrow (1-\alpha_s^k) e_{s-1}^k + \alpha_s^k z_e^{i^*_c}$

where $i^*_c$ identifies the closest encoding of class $c$ in the current batch. This procedure ensures each class partition remains populated with active codevectors.

3. Training and Inference Procedures

Training Algorithm

For each mini-batch of labeled point clouds $(P_i, c_i)$ :

Encode $z_e^i = \mathfrak{E}(P_i)$
For each $i$ , determine $k_i = \arg \min_{k \in \text{block}(c_i)} \| z_e^i - e^k \|^2$
Assign $z_q^i = e^{k_i}$ , reconstruct $\tilde{P}_i = \mathfrak{D}(z_q^i)$
Compute $L_{AE}$ over the batch, backpropagate gradients, update codebook entries via the class-aware running average
Update $U_s^k$ , reinitialize dead codevectors

Inference Workflow (with LFMM)

Objects generated by the Latent-space Flow Matching Model (LFMM) provide a class label $\hat{c}^m$ and a 32-dimensional feature vector $\hat{F}^m$ per object. CPVQ-VAE applies a class-aware inverse lookup:

$\hat{F}^m$ is zero-padded to 128 dimensions.
For the class $\hat{c}$ , choose $k^* = \arg \max_{k \in \text{block}(\hat{c})} \text{cosine}(\hat{F}, e^k_{1:32})$
Quantized latent $z_q = e^{k^*}$ is decoded to $\tilde{P} = \mathfrak{D}(z_q)$

This results in direct generation of class-specific point clouds, bypassing retrieval from external databases.

4. Integration with Latent-space Flow Matching Model (LFMM)

LFMM generates holistic scene layouts by producing object labels and features as inputs for CPVQ-VAE. Each object’s attributes $x = (T;R;S;C;F)$ (translation, rotation, size, class, feature) are vectorized. LFMM learns a vector field $v_\theta(x_t, t, \text{floorplan})$ that transports a Gaussian noise sample $x_0 \sim N(0, I)$ along a linear path $x_t = (1-t)x_0 + t x_1$ towards the data sample $x_1$ . The velocity field $v_\theta$ is predicted by a U-Net with cross-attention to floorplan encodings and optimized using

$L_{FM} = E_{t \sim U[0,1]} \sum_{J \in \{T,R,S,C,F\}} \lambda_J \| v_\theta^J(x_t, t) - (x_1^J - x_0^J) \|_2^2$

Sampling is performed via Euler integration with $N_t=100$ steps:

$x_{t+1} = x_t + \frac{1}{N_t} v_\theta(x_t, t)$

LFMM thus yields box parameters, class probabilities, and latent features that drive the CPVQ-VAE decoding process.

5. Evaluation Metrics and Empirical Results

Quantitative evaluation employs:

Chamfer Distance (CD): as defined above, reported as $\times 10^3$ .
Point2Mesh Error (P2M): the average $\ell_2$ distance from each generated point to the ground-truth mesh surface ( $\alpha$ -wrapped mesh), also reported as $\times 10^3$ .

On the 3D-FRONT living-room dataset:

CPVQ-VAE achieves a 70.4% reduction in Chamfer Distance and a 72.3% reduction in Point2Mesh Error relative to the Diffuscene baseline.
Compared to an LFMM+standard VAE, CPVQ-VAE yields a 63.2% reduction in CD and a 64.7% reduction in P2M.

Qualitative observations indicate that standard VAEs often decode latent codes into incorrect class geometries (e.g., chairs as sofas). CPVQ-VAE’s class-aware lookup mechanism consistently produces shapes matching the generated class, mitigating previous semantic inconsistencies.

6. Significance and Implications

CPVQ-VAE’s innovations—a labeled, class-partitioned codebook and class-aware maintenance—enable direct, semantically accurate point cloud generation for multi-object scenes, eliminating the necessity for pre-defined object databases. When paired with LFMM for scene layout generation, the approach constitutes the first system capable of pure point cloud synthesis of multi-class 3D indoor scenes with high geometric fidelity and class consistency, as demonstrated by substantial error reductions in both CD and P2M metrics (Edirimuni et al., 18 Jan 2026).

A plausible implication is that class-partitioned quantization with active code maintenance could generalize to other modalities and tasks suffering from latent-class inconsistency and codebook collapse. The CPVQ-VAE framework provides methodological advances for scene-level generative models in 3D computer vision, particularly where direct correspondence between latent codes and object classes is essential.

Markdown Report Issue Upgrade to Chat

References (1)

Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE).