Papers
Topics
Authors
Recent
Search
2000 character limit reached

CPVQ-VAE for Class-Consistent Point Clouds

Updated 25 January 2026
  • The paper introduces CPVQ-VAE, which partitions the latent codebook into class-specific bins to ensure geometrically and semantically accurate point cloud generation.
  • It employs class-aware maintenance and a dual-stage encoding-decoding process to counteract codebook collapse and misclassification errors.
  • Empirical results reveal significant reductions in Chamfer Distance and Point2Mesh Error compared to standard VAEs and diffusion-based methods.

The Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) is a generative model architecture developed to produce class-consistent point cloud objects from latent features, enabling direct point cloud scene generation without reliance on external object retrieval databases. CPVQ-VAE addresses failure modes observed when conventional autoencoders or diffusion-based latent decoding yield incorrect object geometries with mismatched classes. By explicitly partitioning the latent codebook into class-labeled bins and employing class-aware codebook maintenance, CPVQ-VAE reliably maps generated latents to point cloud shapes matching the intended object category, achieving significant reductions in geometrical and semantic reconstruction errors (Edirimuni et al., 18 Jan 2026).

1. Architecture and Codebook Partitioning

CPVQ-VAE extends the standard VQ-VAE framework through a dual mechanism: partitioning the codebook into class-specific bins and actively maintaining codebook utilization to counteract codebook collapse. The model operates over point cloud inputs PRNP×3P \in \mathbb{R}^{N_P \times 3}, with NP=2025N_P=2025 points. The encoder E\mathfrak{E} utilizes a PointNet++-style architecture consisting of three set-abstraction layers, feature MLPs, and global max-pooling, mapping PP to a 128-dimensional latent vector zez_e. Quantization snaps zez_e to the nearest codevector within the designated class partition, yielding zqR128z_q \in \mathbb{R}^{128}. The decoder D\mathfrak{D} adopts a FoldingNet-inspired strategy: a fixed 45×4545 \times 45 2D grid is concatenated with zqz_q and passed through two “folding” MLP layers to generate the reconstructed point cloud P~RNP×3\tilde{P} \in \mathbb{R}^{N_P \times 3}.

Let NcN_c denote the number of object classes and NqN_q the number of discrete codevectors per class, so the total codebook size is NK=Nc×NqN_K = N_c \times N_q. Codevectors eke^k are assigned to contiguous class-specific blocks. During quantization, for an object of class cc, only codevectors in block cc are considered.

2. Training Objectives and Codebook Maintenance

CPVQ-VAE’s optimization minimizes three terms comprising the VQ-VAE loss:

  • Autoencoder reconstruction via Chamfer Distance:

LCD(P,P~)=1PpPminqP~pq2+1P~qP~minpPqp2L_{CD}(P,\tilde{P}) = \frac{1}{|P|} \sum_{p \in P} \min_{q \in \tilde{P}} \| p - q \|^2 + \frac{1}{|\tilde{P}|} \sum_{q \in \tilde{P}} \min_{p \in P} \| q - p \|^2

  • Quantization loss: sg[ze]zq22\| \operatorname{sg}[z_e] - z_q \|_2^2
  • Commitment loss: βzesg[zq]22\beta \| z_e - \operatorname{sg}[z_q] \|_2^2, with experimental coefficients λCD=10\lambda_{CD}=10 and β=1\beta=1.

To resolve codebook collapse (large swathes of codevectors unused), CPVQ-VAE uses a class-aware running-average update. For each codevector eke^k, a usage statistic UskU_s^k is maintained:

Usk=γUs1k+(1γ)uskBU_s^k = \gamma U_{s-1}^k + (1-\gamma)\frac{u_s^k}{B}

where usku_s^k approximates usage in batch of size BB, and γ=0.99\gamma=0.99. If UskU_s^k falls below a threshold, eke^k is reinitialized towards the nearest encoding in the batch:

αsk=exp(10UskNq1γϵ)\alpha_s^k = \exp(-10 \cdot U_s^k \frac{N_q}{1-\gamma} - \epsilon)

esk(1αsk)es1k+αskzeice_s^k \leftarrow (1-\alpha_s^k) e_{s-1}^k + \alpha_s^k z_e^{i^*_c}

where ici^*_c identifies the closest encoding of class cc in the current batch. This procedure ensures each class partition remains populated with active codevectors.

3. Training and Inference Procedures

Training Algorithm

For each mini-batch of labeled point clouds (Pi,ci)(P_i, c_i):

  • Encode zei=E(Pi)z_e^i = \mathfrak{E}(P_i)
  • For each ii, determine ki=argminkblock(ci)zeiek2k_i = \arg \min_{k \in \text{block}(c_i)} \| z_e^i - e^k \|^2
  • Assign zqi=ekiz_q^i = e^{k_i}, reconstruct P~i=D(zqi)\tilde{P}_i = \mathfrak{D}(z_q^i)
  • Compute LAEL_{AE} over the batch, backpropagate gradients, update codebook entries via the class-aware running average
  • Update UskU_s^k, reinitialize dead codevectors

Inference Workflow (with LFMM)

Objects generated by the Latent-space Flow Matching Model (LFMM) provide a class label c^m\hat{c}^m and a 32-dimensional feature vector F^m\hat{F}^m per object. CPVQ-VAE applies a class-aware inverse lookup:

  • F^m\hat{F}^m is zero-padded to 128 dimensions.
  • For the class c^\hat{c}, choose k=argmaxkblock(c^)cosine(F^,e1:32k)k^* = \arg \max_{k \in \text{block}(\hat{c})} \text{cosine}(\hat{F}, e^k_{1:32})
  • Quantized latent zq=ekz_q = e^{k^*} is decoded to P~=D(zq)\tilde{P} = \mathfrak{D}(z_q)

This results in direct generation of class-specific point clouds, bypassing retrieval from external databases.

4. Integration with Latent-space Flow Matching Model (LFMM)

LFMM generates holistic scene layouts by producing object labels and features as inputs for CPVQ-VAE. Each object’s attributes x=(T;R;S;C;F)x = (T;R;S;C;F) (translation, rotation, size, class, feature) are vectorized. LFMM learns a vector field vθ(xt,t,floorplan)v_\theta(x_t, t, \text{floorplan}) that transports a Gaussian noise sample x0N(0,I)x_0 \sim N(0, I) along a linear path xt=(1t)x0+tx1x_t = (1-t)x_0 + t x_1 towards the data sample x1x_1. The velocity field vθv_\theta is predicted by a U-Net with cross-attention to floorplan encodings and optimized using

LFM=EtU[0,1]J{T,R,S,C,F}λJvθJ(xt,t)(x1Jx0J)22L_{FM} = E_{t \sim U[0,1]} \sum_{J \in \{T,R,S,C,F\}} \lambda_J \| v_\theta^J(x_t, t) - (x_1^J - x_0^J) \|_2^2

Sampling is performed via Euler integration with Nt=100N_t=100 steps:

xt+1=xt+1Ntvθ(xt,t)x_{t+1} = x_t + \frac{1}{N_t} v_\theta(x_t, t)

LFMM thus yields box parameters, class probabilities, and latent features that drive the CPVQ-VAE decoding process.

5. Evaluation Metrics and Empirical Results

Quantitative evaluation employs:

  • Chamfer Distance (CD): as defined above, reported as ×103\times 10^3.
  • Point2Mesh Error (P2M): the average 2\ell_2 distance from each generated point to the ground-truth mesh surface (α\alpha-wrapped mesh), also reported as ×103\times 10^3.

On the 3D-FRONT living-room dataset:

  • CPVQ-VAE achieves a 70.4% reduction in Chamfer Distance and a 72.3% reduction in Point2Mesh Error relative to the Diffuscene baseline.
  • Compared to an LFMM+standard VAE, CPVQ-VAE yields a 63.2% reduction in CD and a 64.7% reduction in P2M.

Qualitative observations indicate that standard VAEs often decode latent codes into incorrect class geometries (e.g., chairs as sofas). CPVQ-VAE’s class-aware lookup mechanism consistently produces shapes matching the generated class, mitigating previous semantic inconsistencies.

6. Significance and Implications

CPVQ-VAE’s innovations—a labeled, class-partitioned codebook and class-aware maintenance—enable direct, semantically accurate point cloud generation for multi-object scenes, eliminating the necessity for pre-defined object databases. When paired with LFMM for scene layout generation, the approach constitutes the first system capable of pure point cloud synthesis of multi-class 3D indoor scenes with high geometric fidelity and class consistency, as demonstrated by substantial error reductions in both CD and P2M metrics (Edirimuni et al., 18 Jan 2026).

A plausible implication is that class-partitioned quantization with active code maintenance could generalize to other modalities and tasks suffering from latent-class inconsistency and codebook collapse. The CPVQ-VAE framework provides methodological advances for scene-level generative models in 3D computer vision, particularly where direct correspondence between latent codes and object classes is essential.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE).