Flow-Based Affordance Grounding in 3D Robotics
- The paper demonstrates that flow-based affordance grounding successfully generates probabilistic 3D affordance maps by learning coupled structure and affordance flows.
- It fuses sparse voxel features across multiple RGB-D views using a constant token transformer, ensuring efficient 3D reconstruction.
- The approach enables active view selection to iteratively improve reconstruction and affordance prediction under occlusion, outperforming baseline methods.
Flow-based affordance grounding refers to a paradigm in 3D vision and robotics that models the prediction of functional regions on an object’s surface—as specified by a natural language action query—using generative flow-based methods. Unlike standard discriminative predictors, this approach can output entire distributions over affordance maps, thereby capturing inherent ambiguities (e.g., multiple plausible grasp or cut locations). Contemporary methods ground affordances only on observable surfaces; in contrast, flow-based grounding, as instantiated in frameworks like AffordanceDream (also referred to as Affostruction), produces multimodal affordance distributions on the full, reconstructed 3D geometry, including parts unobserved in the sensor data (Park et al., 14 Jan 2026).
1. Problem Definition and Mathematical Formulation
Flow-based affordance grounding is defined on inputs of RGB-D views, each comprising an image , a depth map , camera intrinsics , and extrinsics . Given a query in natural language (e.g., "where to grasp?"), embedded via CLIP, the task is to output a probability distribution across the entire object’s 3D surface, identifying likely affordance locations—even for regions never directly observed.
The complete geometry is represented as sparse 3D voxels: where contains latent structure features. A binary mask with denotes the ground-truth affordance at voxel .
AffordanceDream learns two coupled flow-based models:
- Structure flow: Denoises a dense noise tensor to reconstructed voxels , representing the full shape.
- Affordance flow: Denoises a sparse noise tensor to produce affordance logits , conditioned on the structure and text embedding.
2. Generative Multi-View Reconstruction via Sparse Voxel Fusion
Central to the framework is the conversion of multiple partial RGB-D observations into a unified, complete 3D object representation while maintaining constant computational cost per view.
2.1 Sparse Voxel Fusion
Dense visual features are extracted from each RGB image using DINOv2. Back-projection via depth and camera geometry yields 3D points : Features are aggregated at overlapping voxels across all views (average pooling), with 3D positional encoding appended.
2.2 Constant Token Complexity
By fusing all observed points into a sparse set (, e.g., ), and running a dense flow transformer on tokens (with ), the cross-attention and computational complexity remain in the number of input views. Only the fusion step scales with view count.
2.3 Flow Transformer Architecture and Objective
The structure reconstruction uses a DiT backbone transformer (768 channels, 12 heads/blocks), cross-attending to the sparse voxel features. Conditional flow matching (CFM) is used as the loss: where the time-dependent noisy input anneals from random noise to the true structure.
3. Flow-Based Modeling of Affordance Grounding
Affordance grounding is implemented as a sparse flow-matching process over the reconstructed voxels.
3.1 Flow Field Representation and Conditional Generation
A sparse noise vector evolves under a learned velocity field toward the clean affordance logits (one per voxel). The model is cross-conditioned on the CLIP text embedding of the query.
3.2 Architecture and Loss Design
Affordance flow uses a transformer analogous to the structure stage, but with a single input channel and latent resolution 64, outputting . The loss, combining binary cross-entropy and Dice, is: The full flow-matching objective is: Sampling from this generative model produces diverse, multimodal affordance heatmaps, reflecting ambiguity intrinsic to the affordance grounding task.
4. Affordance-Driven Active View Selection
Instead of passively acquiring additional views, the system leverages its predicted affordances to drive exploration, maximizing informativeness under a limited view budget.
4.1 View Scoring and Policy
The reconstructed mesh, colored by predicted affordances, is rendered from candidate poses. For each pose , the 2D projection of affordance is scored by summing visible activations: The next-best view is chosen by: The loop (acquire → fuse → reconstruct → re-ground affordances) is iterated up to 8 times, with each fusion step fixed in token size, yielding real-time operation.
5. Empirical Evaluation and Comparative Analysis
5.1 Datasets and Benchmarks
AffordanceDream evaluates on the 3D-FUTURE, HSSD, ABO for training (plus Affogato’s train split for affordance), Toys4k for held-out 3D reconstruction, and the full test split of Affogato for affordance prediction.
5.2 Metrics
- Reconstruction: volumetric intersection-over-union (IoU), Chamfer Distance (CD), F-score @ 0.05, PSNR/LPIPS for color and normal images
- Affordance (Complete): average IoU (aIoU), AUC, Simmelian similarity (SIM), mean absolute error (MAE)
- Affordance (Partial): multi-threshold aIoU/aCD
5.3 Results and Ablations
AffordanceDream records a 67.7% gain in 3D reconstruction IoU (32.67 vs 19.49 for TRELLIS) and a 40.4% improvement in complete-shape aIoU (19.1 vs 13.6 for Espresso-3D). In the partial-view regime, aIoU increases substantially with active view addition, outperforming random and sequential view policies.
| Method (Toys4k, 51 views) | IoU | CD | F-score |
|---|---|---|---|
| TRELLIS | 19.49 | 0.3694 | 0.0496 |
| MCC (depth-based) | 21.11 | 0.3299 | 0.0648 |
| AffordanceDream (ours) | 32.67 | 0.2427 | 0.0997 |
Ablation shows that stochastic multi-view training (sampling 1-8 views per iteration) is mandatory for test-time robustness, and that affordance-driven view selection achieves monotonic aIoU gains, outperforming both random and sequential strategies.
5.4 Qualitative Analysis
- Reconstruction: Handles, holes, and complex thin structures are more accurately and smoothly completed than competing methods.
- Affordance Diversity: Multiple samples from the affordance flow yield distinctly valid affordance regions for the same object and action, evidencing the model’s capacity to capture uncertainty and ambiguity.
- Partial Scenes: Even from a single (visible) view, the method plausibly hallucinates unobserved but functional structures and places affordances appropriately.
- Active Loop Dynamics: Progressive accumulation of views measurably improves both geometry and affordance map quality.
6. Current Limitations and Prospects
The method operates on isolated objects. Extension to multi-object, cluttered, or articulated scenes is a clear future direction. Affordance-driven view selection employs affordance-rendered meshes for policy, rather than a fully learned planner; integrating active exploration and uncertainty estimation into an end-to-end differentiable system is a targeted area for progress. Finally, combining the affordance flow with downstream manipulation primitives is a prospective avenue for closing the perception-to-action loop in robotic systems.
7. Significance and Broader Implications
Generative completion of unseen geometry is shown to be essential for reliable affordance grounding under occlusion and sparse observations. The proposed sparse voxel fusion maintains constant transformer complexity per view, enabling practical scaling. Flow-based affordance modeling permits generation of multimodal, plausible affordance regions consistent with real-world ambiguity. Coupling with an affordance-driven active viewpoint policy yields significant efficiency gains, especially under constrained sensor budgets. These advances collectively contribute toward robust real-world 3D scene understanding and manipulation under occlusion (Park et al., 14 Jan 2026).