Instance-Aware NBV Policy

Updated 10 February 2026

The paper introduces instance-aware NBV as a framework that guides sensor viewpoint selection using semantic labels and spatial masks for targeted tasks.
It employs attention mechanisms like spatial masking and object confidence weighting to prioritize regions, reducing required sensor views and enhancing reconstruction accuracy.
The approach features closed-loop routines with candidate view sampling and integration with manipulation affordances, ensuring efficient real-time active perception.

An instance-aware Next Best View (NBV) policy is a class of active perception frameworks in which sensor viewpoint selection is explicitly guided by knowledge about discrete objects or task-specific regions of interest (ROIs) in a scene. Unlike scene-agnostic NBV planners that treat all spatial volumes as equally important, instance-aware methods bias the information gain or utility metric toward those regions or object instances that are most salient for reconstruction, recognition, or manipulation objectives. This paradigm has become influential in robotics and computer vision applications such as targeted 3D reconstruction, category-specific scene understanding, and instance-driven manipulation under severe occlusions, with notable developments spanning structured occupancy maps, implicit representations, and 3D Gaussian splatting.

1. Problem Formulation and Foundational Principles

Instance-aware NBV planners maintain a scene representation—often a volumetric structure such as an octree (OctoMap) or a truncated signed distance function (TSDF)—augmented with semantic or instance labels to reflect task-centric priorities. At each acquisition step, the system samples a set of candidate camera poses $\mathcal{V}$ and predicts, for each goal-directed region $\mathcal{B}$ (e.g., a bounding box around a target object or plant part), which next viewpoint $\xi^*\in\mathcal{V}$ yields maximal expected utility, typically formalized as information gain or uncertainty reduction focused on $\mathcal{B}$ .

For example, in targeted 3D plant reconstruction, the planner defines a spatial mask $w(x) = \mathbf{1}_{x\in\mathcal{B}}$ and restricts the information gain

$\mathrm{IG}_\mathrm{att}(\xi) = \sum_{x\in (\mathcal{X}_\xi\cap \mathcal{B})} I_v(x)$

to only those voxels relevant to the reconstruction objective, where $I_v(x)$ is the Shannon entropy of the posterior occupancy $p_o(x)$ (Burusa et al., 2022). This instance-driven formulation generalizes across domains by substituting other utility criteria or representation forms (e.g., Gaussian instance probabilities or grasp affordances).

2. Information Gain Metrics and Attention Mechanisms

The core distinction of instance-aware NBV is the integration of attention mechanisms—explicit or implicit weighting of the utility metric by instance or region relevance. In the classical volumetric NBV setting, expected utility is computed as a global sum of per-voxel information measures (entropy, occupancy variance, or similar), which the attention mechanism modifies to prioritize instance-specific goals.

Different formulations include:

Spatial Masking: As in (Burusa et al., 2022), attention is applied by setting information gain weights to zero outside the designated ROI, rendering the NBV computation instance-centric.
Object Confidence Weighting: In object-centric 3D Gaussian Splatting scenarios, per-Gaussian confidence weights $c_g$ are computed based on object vector probabilities and the opacity of each Gaussian, strongly amplifying the contribution of underexplored or uncertain object instances. The weighted Fisher information matrix:

$H_V = \sum_{v\in V} J_v^\top C J_v$

selectively increases information gain for Gaussians assigned to the target object (Jeong et al., 9 Feb 2026).

Recognition-Driven and Detected-Instance Priors: In scene autoscanning, instance segmentation and objectness scores direct exploration and subsequent NBV planning toward the object of interest, as in the object-aware framework that alternates Next-Best-Object (NBO) and local NBV selection (Liu et al., 2018).

3. Algorithmic Frameworks and Computational Aspects

Instance-aware NBV policies are instantiated as closed-loop routines, iteratively updating the scene and instance representations and greedily or iteratively selecting camera poses. Algorithmic templates across domains share several features:

Candidate View Sampling: Views are generated on geometric shells or hemispheres around the instance, constrained by robot reachability, collision checks, and desired coverage granularity (N=16 to 100 in different studies).
Greedy One-Step Lookahead: The computational cost per step scales with the number of candidate views $N$ , the number of traced rays $R$ per candidate, and the number of voxels $D$ per ray ( $O(NRD)$ for volumetric ray tracing) (Burusa et al., 2022), or with the cost of projected information gain in Gaussian parameter space for 3DGS-based planners (Jeong et al., 9 Feb 2026).
Instance Conditioning: In all methods, the information gain (either directly via masks or indirectly via instance weights/confidences) is conditioned on the current segmentation, instance assignment, or affordance prediction to ensure task relevance.
Integration with Downstream Tasks: In manipulation, view selection is tied to predicted grasp success rates or other manipulation-oriented affordances (Zhang et al., 2023, Breyer et al., 2022).

Pseudocode formulations are provided in major references for efficient practical reproduction, with all NBV cycles designed for low-latency execution to enable real-time robot operation.

4. Representative Domains and Applications

Instance-aware NBV has been successfully applied in several challenging robotics and vision contexts:

Targeted Plant Phenotyping: View selectors explicitly focus on reconstructing specific organs (e.g., main stem, leaf nodes) in occlusion-rich environments, yielding a reduction in required sensor movements (1–3 fewer views) and increases in F1-score by 6–26% for targeted parts (Burusa et al., 2022).
Object-centric 3D Scanning: NBV and NBO cycles are tightly interleaved, segmenting scenes into instances and adapting both object selection and local view planning to maximize reconstruction completeness and recognition accuracy (Liu et al., 2018).
Manipulation in Clutter: Confidence-weighted gain metrics in object-aware 3DGS reduce depth error for specific objects by up to 25–77% compared to scene-agnostic approaches, propagating gains to grasp success when used as input for manipulation policies (Jeong et al., 9 Feb 2026). Affordance-driven NBV further optimizes for task-oriented grasp feasibility, integrating implicit neural scene representations and predicted grasp qualities per candidate view (Zhang et al., 2023).
Reactive Grasp-Driven NBV: Closed-loop policies for robot manipulation integrate TSDF-based instance awareness and NBV planning driven by occluded voxel counts to achieve high grasp success rates and reduced execution times under occlusions (Breyer et al., 2022).

5. Evaluation Protocols and Quantitative Performance

Instance-aware NBV frameworks are evaluated on two primary axes: reconstruction accuracy (point cloud F1, depth MAE, PSNR/SSIM/LPIPS) and task efficiency (number of views acquired, success rate in actual manipulation or recognition tasks).

Notable results include:

Instance-centric NBV in greenhouse plant phenotyping achieves F1 gains up to 25.9% and reduces necessary views by 1–3, both in simulation and complex real-world settings (Burusa et al., 2022).
Object-centric NBV in 3DGS reduces depth MAE by up to 77.14% on synthetic and 34.10% on real-world settings compared to geometric or random baselines. Object-targeted NBV further improves depth accuracy on the selected object by 25.60% relative to whole-scene NBV (Jeong et al., 9 Feb 2026).
Affordance-driven policies find higher-quality grasps with fewer views compared to geometry-driven or fixed-view baselines, completing cluttered manipulation tasks with fewer aborts and faster cycle times (Zhang et al., 2023, Breyer et al., 2022).

6. Connections, Variations, and Future Prospects

Contemporary trends in instance-aware NBV reflect a convergence between explicit geometric attention (volumetric masking, instance segmentation) and learned, representation-driven approaches that embed instance cues in generative models (Gaussian splatting, neural SDFs). A key trajectory is the increasing coupling of NBV with semantic understanding, manipulation affordances, and reinforcement objectives optimized for specific tasks beyond general 3D completeness.

Remaining challenges include scaling to dynamic multi-object scenes, incorporating open-world instance recognition (e.g., CLIP-driven NBV (Jeong et al., 9 Feb 2026)), and extending existing frameworks to real-time, closed-loop settings with tight computational and latency constraints. The domain is marked by a strong interplay between algorithmic efficiency, task specificity, and the robustness of attention mechanisms in diverse physical environments.