Grouped Discrete Representations
- Grouped Discrete Representations are a technique that splits feature channels into independent attribute groups, each quantized with its own codebook.
- GDR utilizes tuple-index quantization to represent features, enhancing compositionality by separately encoding attributes like color, shape, and texture.
- This approach improves convergence speed, segmentation accuracy, and interpretability in VAE-based object-centric models, making it a robust alternative to scalar-index discretization.
Grouped Discrete Representations (GDR) are a class of discrete bottleneck techniques for object-centric learning (OCL) models, especially those leveraging discrete variational autoencoders (dVAEs) or vector quantized VAEs (VQ-VAEs) for dense-to-sparse abstraction. The central contribution of GDR is the decomposition of feature vectors at each spatial location into groups corresponding to underlying object attributes. Instead of assigning a scalar code index per spatial feature, GDR discretizes each attribute group with its own small codebook and represents features using a tuple of indices, enabling combinatorial attribute representation, improved compositionality, and more interpretable, generalizable object-centric abstractions (Zhao et al., 2024, Zhao et al., 2024, Zhao et al., 2024).
1. Motivation and Conceptual Foundations
Object-centric learning seeks to represent raw perceptual input (images, video) as structured, object-level features for use in downstream modeling and reasoning. Classical OCL pipelines (e.g., SLATE, STEVE, SlotAttention) initially compress dense pixel representations into spatial feature maps, which are then discretized using a codebook learned via a dVAE or VQ-VAE bottleneck. This quantization process maps each spatial feature vector to its single nearest code in a flat codebook, suppressing pixel-level noise and guiding slot attention toward object-level structure.
However, conventional scalar-index discretization entangles multiple visual attributes (such as color, shape, and texture) into single high-dimensional codes. This lack of compositionality leads to poor generalization on unseen attribute combinations and slows convergence, as there is no attribute-level regularity for decoders to exploit. The GDR insight is to explicitly group the channels of feature vectors into "attribute subspaces" and quantize each with an independent sub-codebook, resulting in tuple-index representations that render attribute commonalities and distinctions explicit (Zhao et al., 2024, Zhao et al., 2024).
2. Mathematical Formulation
Let be an input image or frame, and its continuous feature map. GDR partitions the channel dimension as for some group count and group size . The feature map is sliced as
For each group , a separate codebook is learned. Each spatial feature is then discretized as follows:
- For each group and each spatial location compute the code assignment
- The complete discrete code at position is the tuple . The quantized feature is assembled as
During training, a Gumbel-softmax or L2-distance soft assignment provides differentiability. To encourage balanced usage across all codes, a utilization regularizer penalizes codebook collapse: where is the soft assignment probability for code in group .
The overall loss function for GDR in a VAE-based pipeline includes:
- Reconstruction loss,
- Commitment and codebook losses,
- Code utilization loss (Zhao et al., 2024, Zhao et al., 2024).
For downstream OCL model training, tuple indices can be flattened to scalar indices for cross-entropy loss via
3. Integration with Object-Centric Learning Architectures
GDR augments transformer-based OCL models by replacing traditional scalar code assignment with group-wise, tuple-index quantization both in feature encoding and in reconstruction guidance.
Encoder pipeline:
- Feature maps from a dVAE are grouped and quantized as described, yielding tuple-index code maps per spatial location.
- These discrete maps are used both for reconstruction (VAE decoder) and to guide slot attention.
Decoder pipeline:
- Quantized features feed into Slot Attention modules and then into transformer-based or diffusion-based decoders, which are trained to predict the next timestep's tuple-index discrete representations (for autoregressive or diffusion-based OCL models) (Zhao et al., 2024, Zhao et al., 2024).
- The loss combines reconstruction fidelity and discrete classification (or diffusion) objectives.
4. Empirical Evaluation and Performance
GDR has been validated on a suite of image and video object-centric segmentation and representation benchmarks:
| Base Model | Dataset/Domain | Metric (ARI, ARI_fg, mIoU, mBO) | GDR Gain |
|---|---|---|---|
| SLATE | ClevrTex | ARI: 83.2% → 89.5% (+6.3%) | Faster convergence |
| SLATE | COCO | ARI: 68.1% → 73.5% (+5.4%) | Higher accuracy |
| STEVE | MOVi-C video | ARI_fg: 81.0% → 87.2% (+6.2%) | Improved transfer |
| SlotDiffusion | ClevrTex | ARI: 90.4% → 93.1% (+2.7%) | Stability |
| SLATE | PASCAL VOC | mIoU: +2–4 pts | Robust OOD |
In all cases, GDR consistently improved convergence speed, final ARI/mIoU segmentation metrics, and generalizability versus traditional scalar-index or “flat codebook” discretization. Performance gains were robust under both random and condition-based query initialization, and when GDR was integrated with transformer- or diffusion-based OCL decoders (Zhao et al., 2024, Zhao et al., 2024, Zhao et al., 2024).
5. Qualitative Interpretability and Attribute Manipulation
Qualitative analyses highlight several key properties of GDR:
- Attribute-level visualizations: Mapping tuple indices to color channels demonstrates that each group codebook specializes in different attributes. For example, one group may focus on color distinctions while another captures texture boundaries.
- Attribute manipulation: Modifying single group codes for selected objects changes their isolated visual attribute (e.g., color or texture) while leaving others invariant. This shows that tuple-indexing yields semantically meaningful, decomposable representations.
- Improved object separability: Distance-to-center and code-similarity heatmaps reveal that GDR yields more compact, object-aligned clusters than scalar-index quantization, suppressing spurious correlations across objects (Zhao et al., 2024, Zhao et al., 2024, Zhao et al., 2024).
6. Design Choices, Limitations, and Extensions
Group Count and Codebook Size
Empirical ablations indicate:
- is optimal for simple, single-object scenes, offers stability for multi-object datasets, and provides the highest potential but increased variance.
- Per-group codebook size must be chosen so (the intended effective total codebook size).
- Larger group size is beneficial up to a point, but with diminishing returns (Zhao et al., 2024, Zhao et al., 2024).
Channel Grouping and Mis-grouping
Naive channel splitting along the feature channel dimension risks grouping unrelated semantic attributes. This leads to information loss and suboptimal quantization (Zhao et al., 2024). Organized GDR (OGDR) addresses this by introducing an invertible projection that “organizes” channels before group quantization. The projection collects semantically related channels together, performs group-wise quantization, and projects back, ensuring true attribute decomposability while retaining expressivity. OGDR yields higher template diversity (confirmed by codebook PCA) and improved object-invariance in code assignments (Zhao et al., 2024).
Hyperparameter Recommendations
Empirical recommendations include:
- Total codes , groups, per-group , ,
- Invertible projection with channel expansion ratio 8 (or 4),
- Utilization regularizer weight ,
- Gumbel temperature annealed ,
- Layer normalization after group quantization (Zhao et al., 2024).
7. Significance and Impact
Grouped Discrete Representations inject explicit, combinatorial attribute structure into discrete VAE bottlenecks for OCL. This provides:
- Stronger semantic guidance for object slot assignment and decoding,
- Vastly increased code reuse through the compositional tuple space,
- Substantial improvements in generalization, sample efficiency, and interpretability,
- A drop-in enhancement to any VQ-VAE or dVAE-based transformer or diffusion OCL pipeline.
OGDR further demonstrates that semantically organizing channel groupings before quantization eliminates redundancy, preserves fine-grained attribute information, and translates into higher segmentation accuracy across both transformer- and diffusion-based architectures (Zhao et al., 2024, Zhao et al., 2024, Zhao et al., 2024). These findings suggest a new standard for discrete representation learning in object-centric modeling, with implications for both theory and practical systems.