CPVQ-VAE for Class-Consistent Point Clouds
- The paper introduces CPVQ-VAE, which partitions the latent codebook into class-specific bins to ensure geometrically and semantically accurate point cloud generation.
- It employs class-aware maintenance and a dual-stage encoding-decoding process to counteract codebook collapse and misclassification errors.
- Empirical results reveal significant reductions in Chamfer Distance and Point2Mesh Error compared to standard VAEs and diffusion-based methods.
The Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) is a generative model architecture developed to produce class-consistent point cloud objects from latent features, enabling direct point cloud scene generation without reliance on external object retrieval databases. CPVQ-VAE addresses failure modes observed when conventional autoencoders or diffusion-based latent decoding yield incorrect object geometries with mismatched classes. By explicitly partitioning the latent codebook into class-labeled bins and employing class-aware codebook maintenance, CPVQ-VAE reliably maps generated latents to point cloud shapes matching the intended object category, achieving significant reductions in geometrical and semantic reconstruction errors (Edirimuni et al., 18 Jan 2026).
1. Architecture and Codebook Partitioning
CPVQ-VAE extends the standard VQ-VAE framework through a dual mechanism: partitioning the codebook into class-specific bins and actively maintaining codebook utilization to counteract codebook collapse. The model operates over point cloud inputs , with points. The encoder utilizes a PointNet++-style architecture consisting of three set-abstraction layers, feature MLPs, and global max-pooling, mapping to a 128-dimensional latent vector . Quantization snaps to the nearest codevector within the designated class partition, yielding . The decoder adopts a FoldingNet-inspired strategy: a fixed 2D grid is concatenated with and passed through two “folding” MLP layers to generate the reconstructed point cloud .
Let denote the number of object classes and the number of discrete codevectors per class, so the total codebook size is . Codevectors are assigned to contiguous class-specific blocks. During quantization, for an object of class , only codevectors in block are considered.
2. Training Objectives and Codebook Maintenance
CPVQ-VAE’s optimization minimizes three terms comprising the VQ-VAE loss:
- Autoencoder reconstruction via Chamfer Distance:
- Quantization loss:
- Commitment loss: , with experimental coefficients and .
To resolve codebook collapse (large swathes of codevectors unused), CPVQ-VAE uses a class-aware running-average update. For each codevector , a usage statistic is maintained:
where approximates usage in batch of size , and . If falls below a threshold, is reinitialized towards the nearest encoding in the batch:
where identifies the closest encoding of class in the current batch. This procedure ensures each class partition remains populated with active codevectors.
3. Training and Inference Procedures
Training Algorithm
For each mini-batch of labeled point clouds :
- Encode
- For each , determine
- Assign , reconstruct
- Compute over the batch, backpropagate gradients, update codebook entries via the class-aware running average
- Update , reinitialize dead codevectors
Inference Workflow (with LFMM)
Objects generated by the Latent-space Flow Matching Model (LFMM) provide a class label and a 32-dimensional feature vector per object. CPVQ-VAE applies a class-aware inverse lookup:
- is zero-padded to 128 dimensions.
- For the class , choose
- Quantized latent is decoded to
This results in direct generation of class-specific point clouds, bypassing retrieval from external databases.
4. Integration with Latent-space Flow Matching Model (LFMM)
LFMM generates holistic scene layouts by producing object labels and features as inputs for CPVQ-VAE. Each object’s attributes (translation, rotation, size, class, feature) are vectorized. LFMM learns a vector field that transports a Gaussian noise sample along a linear path towards the data sample . The velocity field is predicted by a U-Net with cross-attention to floorplan encodings and optimized using
Sampling is performed via Euler integration with steps:
LFMM thus yields box parameters, class probabilities, and latent features that drive the CPVQ-VAE decoding process.
5. Evaluation Metrics and Empirical Results
Quantitative evaluation employs:
- Chamfer Distance (CD): as defined above, reported as .
- Point2Mesh Error (P2M): the average distance from each generated point to the ground-truth mesh surface (-wrapped mesh), also reported as .
On the 3D-FRONT living-room dataset:
- CPVQ-VAE achieves a 70.4% reduction in Chamfer Distance and a 72.3% reduction in Point2Mesh Error relative to the Diffuscene baseline.
- Compared to an LFMM+standard VAE, CPVQ-VAE yields a 63.2% reduction in CD and a 64.7% reduction in P2M.
Qualitative observations indicate that standard VAEs often decode latent codes into incorrect class geometries (e.g., chairs as sofas). CPVQ-VAE’s class-aware lookup mechanism consistently produces shapes matching the generated class, mitigating previous semantic inconsistencies.
6. Significance and Implications
CPVQ-VAE’s innovations—a labeled, class-partitioned codebook and class-aware maintenance—enable direct, semantically accurate point cloud generation for multi-object scenes, eliminating the necessity for pre-defined object databases. When paired with LFMM for scene layout generation, the approach constitutes the first system capable of pure point cloud synthesis of multi-class 3D indoor scenes with high geometric fidelity and class consistency, as demonstrated by substantial error reductions in both CD and P2M metrics (Edirimuni et al., 18 Jan 2026).
A plausible implication is that class-partitioned quantization with active code maintenance could generalize to other modalities and tasks suffering from latent-class inconsistency and codebook collapse. The CPVQ-VAE framework provides methodological advances for scene-level generative models in 3D computer vision, particularly where direct correspondence between latent codes and object classes is essential.