Flow-Based Affordance Grounding in 3D Robotics

Updated 21 January 2026

The paper demonstrates that flow-based affordance grounding successfully generates probabilistic 3D affordance maps by learning coupled structure and affordance flows.
It fuses sparse voxel features across multiple RGB-D views using a constant token transformer, ensuring efficient 3D reconstruction.
The approach enables active view selection to iteratively improve reconstruction and affordance prediction under occlusion, outperforming baseline methods.

Flow-based affordance grounding refers to a paradigm in 3D vision and robotics that models the prediction of functional regions on an object’s surface—as specified by a natural language action query—using generative flow-based methods. Unlike standard discriminative predictors, this approach can output entire distributions over affordance maps, thereby capturing inherent ambiguities (e.g., multiple plausible grasp or cut locations). Contemporary methods ground affordances only on observable surfaces; in contrast, flow-based grounding, as instantiated in frameworks like AffordanceDream (also referred to as Affostruction), produces multimodal affordance distributions on the full, reconstructed 3D geometry, including parts unobserved in the sensor data (Park et al., 14 Jan 2026).

1. Problem Definition and Mathematical Formulation

Flow-based affordance grounding is defined on inputs of $N$ RGB-D views, each comprising an image $I_i \in \mathbb{R}^{H \times W \times 3}$ , a depth map $D_i \in \mathbb{R}^{H \times W}$ , camera intrinsics $K_i$ , and extrinsics $T_i$ . Given a query in natural language (e.g., "where to grasp?"), embedded via CLIP, the task is to output a probability distribution across the entire object’s 3D surface, identifying likely affordance locations—even for regions never directly observed.

The complete geometry is represented as sparse 3D voxels: $\{(p_m, z_m)\}_{m=1}^M,\quad p_m \in \mathbb{R}^3,\ z_m \in \mathbb{R}^C$ where $z_m$ contains latent structure features. A binary mask $\{(p_m, a_m)\}$ with $a_m \in \{0,1\}$ denotes the ground-truth affordance at voxel $p_m$ .

AffordanceDream learns two coupled flow-based models:

Structure flow: Denoises a dense noise tensor $\mathbf{X} \in \mathbb{R}^{r^3 \times C}$ to reconstructed voxels $\{p_m\}$ , representing the full shape.
Affordance flow: Denoises a sparse noise tensor $\mathbf{A} \in \mathbb{R}^M$ to produce affordance logits $\{a_m\}$ , conditioned on the structure and text embedding.

2. Generative Multi-View Reconstruction via Sparse Voxel Fusion

Central to the framework is the conversion of multiple partial RGB-D observations into a unified, complete 3D object representation while maintaining constant computational cost per view.

2.1 Sparse Voxel Fusion

Dense visual features $\mathbf{f}_i(u, v)$ are extracted from each RGB image using DINOv2. Back-projection via depth and camera geometry yields 3D points $p_{i,j}$ : $p = T_i K_i^{-1}[u, v, 1]^\top \cdot d$ Features are aggregated at overlapping voxels across all views (average pooling), with 3D positional encoding appended.

2.2 Constant Token Complexity

By fusing all observed points into a sparse set ( $M \ll r^3$ , e.g., $M = 1024{-}4096$ ), and running a dense flow transformer on $r^3$ tokens (with $r=16$ ), the cross-attention and computational complexity remain $O(1)$ in the number of input views. Only the fusion step scales with view count.

2.3 Flow Transformer Architecture and Objective

The structure reconstruction uses a DiT backbone transformer (768 channels, 12 heads/blocks), cross-attending to the sparse voxel features. Conditional flow matching (CFM) is used as the loss: $\mathcal{L}_\mathrm{CFM} = \mathbb{E}_{t, \mathbf{X}_0, \epsilon} \| v_\theta(\mathbf{X}_t, C_{\text{voxel}, t}) - (\epsilon - \mathbf{X}_0) \|_2^2$ where the time-dependent noisy input $\mathbf{X}_t$ anneals from random noise to the true structure.

3. Flow-Based Modeling of Affordance Grounding

Affordance grounding is implemented as a sparse flow-matching process over the reconstructed voxels.

3.1 Flow Field Representation and Conditional Generation

A sparse noise vector $\mathbf{A}_t \in \mathbb{R}^M$ evolves under a learned velocity field $v_\phi(\mathbf{A}_t, C_{\text{text}}, t)$ toward the clean affordance logits $\mathbf{A}_0$ (one per voxel). The model is cross-conditioned on the CLIP text embedding of the query.

3.2 Architecture and Loss Design

Affordance flow uses a transformer analogous to the structure stage, but with a single input channel and latent resolution 64, outputting $\mathbf{A}_0$ . The loss, combining binary cross-entropy and Dice, is: $\mathcal{L}_\mathrm{mask}(A', A) = \mathrm{BCE}(A', A) + \mathrm{Dice}(A', A)$ The full flow-matching objective is: $\mathcal{L}_\mathrm{CFM}^{\mathrm{aff}} = \mathbb{E}_{t,A_0,\epsilon} \left[ \mathcal{L}_\mathrm{mask}(\epsilon - v_\phi(A_t, C_{\text{text}}, t), A_0) \right]$ Sampling from this generative model produces diverse, multimodal affordance heatmaps, reflecting ambiguity intrinsic to the affordance grounding task.

4. Affordance-Driven Active View Selection

Instead of passively acquiring additional views, the system leverages its predicted affordances to drive exploration, maximizing informativeness under a limited view budget.

4.1 View Scoring and Policy

The reconstructed mesh, colored by predicted affordances, is rendered from $K$ candidate poses. For each pose $\pi_k$ , the 2D projection of affordance is scored by summing visible activations: $S(\pi_k; \mathcal{M}) = \sum_{u,v} A^{(k)}_{\mathrm{render}}(u,v)$ The next-best view is chosen by: $\pi^* = \arg\max_{k=1\ldots K} S(\pi_k; \mathcal{M})$ The loop (acquire → fuse → reconstruct → re-ground affordances) is iterated up to 8 times, with each fusion step fixed in token size, yielding real-time operation.

5. Empirical Evaluation and Comparative Analysis

5.1 Datasets and Benchmarks

AffordanceDream evaluates on the 3D-FUTURE, HSSD, ABO for training (plus Affogato’s train split for affordance), Toys4k for held-out 3D reconstruction, and the full test split of Affogato for affordance prediction.

5.2 Metrics

Reconstruction: volumetric intersection-over-union (IoU), Chamfer Distance (CD), F-score @ 0.05, PSNR/LPIPS for color and normal images
Affordance (Complete): average IoU (aIoU), AUC, Simmelian similarity (SIM), mean absolute error (MAE)
Affordance (Partial): multi-threshold aIoU/aCD

5.3 Results and Ablations

AffordanceDream records a 67.7% gain in 3D reconstruction IoU (32.67 vs 19.49 for TRELLIS) and a 40.4% improvement in complete-shape aIoU (19.1 vs 13.6 for Espresso-3D). In the partial-view regime, aIoU increases substantially with active view addition, outperforming random and sequential view policies.

Method (Toys4k, 51 views)	IoU	CD	F-score
TRELLIS	19.49	0.3694	0.0496
MCC (depth-based)	21.11	0.3299	0.0648
AffordanceDream (ours)	32.67	0.2427	0.0997

Ablation shows that stochastic multi-view training (sampling 1-8 views per iteration) is mandatory for test-time robustness, and that affordance-driven view selection achieves monotonic aIoU gains, outperforming both random and sequential strategies.

5.4 Qualitative Analysis

Reconstruction: Handles, holes, and complex thin structures are more accurately and smoothly completed than competing methods.
Affordance Diversity: Multiple samples from the affordance flow yield distinctly valid affordance regions for the same object and action, evidencing the model’s capacity to capture uncertainty and ambiguity.
Partial Scenes: Even from a single (visible) view, the method plausibly hallucinates unobserved but functional structures and places affordances appropriately.
Active Loop Dynamics: Progressive accumulation of views measurably improves both geometry and affordance map quality.

6. Current Limitations and Prospects

The method operates on isolated objects. Extension to multi-object, cluttered, or articulated scenes is a clear future direction. Affordance-driven view selection employs affordance-rendered meshes for policy, rather than a fully learned planner; integrating active exploration and uncertainty estimation into an end-to-end differentiable system is a targeted area for progress. Finally, combining the affordance flow with downstream manipulation primitives is a prospective avenue for closing the perception-to-action loop in robotic systems.

7. Significance and Broader Implications

Generative completion of unseen geometry is shown to be essential for reliable affordance grounding under occlusion and sparse observations. The proposed sparse voxel fusion maintains constant transformer complexity per view, enabling practical scaling. Flow-based affordance modeling permits generation of multimodal, plausible affordance regions consistent with real-world ambiguity. Coupling with an affordance-driven active viewpoint policy yields significant efficiency gains, especially under constrained sensor budgets. These advances collectively contribute toward robust real-world 3D scene understanding and manipulation under occlusion (Park et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Affostruction: 3D Affordance Grounding with Generative Reconstruction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flow-Based Affordance Grounding.