One-Shot Multimodal Active Perception

Updated 23 January 2026

One-shot multimodal active perception is a single-step method that fuses 2D silhouette and 3D geometric cues to predict optimal camera viewpoints.
The framework decouples the evaluation of viewpoint quality from the learning model, enabling versatile application across various robotic manipulation tasks.
A cross-attention transformer architecture integrates multimodal inputs to directly regress six-DoF camera adjustments, significantly boosting grasp success rates.

A one-shot multimodal active perception framework for robotic manipulation refers to the approach wherein a robot, from a single observation, predicts in one inference step the camera motion required to reach an optimal, task-informative viewpoint—eschewing multi-step, closed-loop next-best-view strategies. The multimodal aspect denotes the integration of distinct perceptual inputs, typically a 2D object mask and a 3D point cloud, enabling exploitation of both silhouette and geometric cues. Decoupling the viewpoint quality evaluation from the learning architecture confers broad applicability across manipulation tasks and allows task-agnostic dataset generation and training. A cross-attention-based multimodal network architecture is used to directly regress the required camera pose adjustment, and the approach is validated on robotic grasping in both simulation and real-world environments, exhibiting substantial improvements in task success rates and robust sim-to-real transfer without further tuning (Qin et al., 20 Jan 2026).

1. Formal Problem and Approach Overview

One-shot active perception reduces the problem of viewpoint planning to single-step inference: given a current input, the system predicts a six-DoF camera adjustment, avoiding iterative candidate evaluation or sequential next-best-view methods. Multimodal fusion is adopted to leverage object geometry (from depth-based point clouds) and appearance or segmentation cues (from mask images).

Data Collection Pipeline

The pipeline is organized in two primary phases:

Systematic Candidate Viewpoint Sampling: For each object $\mathrm{obj}_i$ $obj_{i}$ , $N$ $N$ legal camera poses $(\mathbf{t}_{\mathrm{cam}}^{(j)}, \mathbf{q}_{\mathrm{cam}}^{(j)})$ are sampled on an admissible viewing manifold (e.g., side-views).
- For each pose, the robot collects RGB-D, precise camera pose, and a semantic mask computed via a detector/segmenter.
Viewpoint Quality Evaluation & Domain Randomization: Each observation is assigned a viewscore $Q_j$ $Q_{j}$ using a task-specific function, commonly instantiated (for grasping) as the average grasp quality metric:

$Q_j = \frac{1}{K} \sum_{k=1}^K s_{jk},$

where $s_{jk}$ is the grasp model's output for the $k$ -th candidate from view $j$ .
- The top $M$ viewpoints by $Q$ are clustered (e.g., via DBSCAN) and the centroid is annotated as the optimal $(\mathbf{t}_{\mathrm{best}}, \mathbf{q}_{\mathrm{best}})$ .
- This is repeated under heavy domain randomization (varying object type, pose, scale, illumination, backgrounds, and camera), yielding a synthetic dataset of $(\text{mask}, \text{pc}, \Delta \mathbf{T})$ tuples.

2. Optimal Viewpoint: Definition and Quality Metrics

The framework defines the optimal viewpoint relative to a set $\{\mathbf{T}_j = (\mathbf{t}_j, \mathbf{q}_j)\}_{j=1}^N$ of candidates. Each is ranked using

$Q(\mathbf{T}_j) = \mathrm{Agg}\left(\{s_{jk}\}_{k=1}^K\right) = \frac{1}{K} \sum_{k=1}^K s_{jk},$

where $s_{jk}$ are model outputs (e.g., grasp scores). The optimal viewpoint is selected as

$(\mathbf{t}_{\mathrm{best}}, \mathbf{q}_{\mathrm{best}}) = \mathrm{ClusterCentroid}\left(\underset{j}{\mathrm{TopM}}\, Q(\mathbf{T}_j)\right),$

with $M$ chosen for robustness to sampling noise.

Crucially, $Q(\cdot)$ is external to the perception network. For different downstream requirements (e.g., maximizing visible object surface area, minimizing pose uncertainty), any deterministic or learned task metric can be substituted, yielding an optimal view definition matched to arbitrary manipulation objectives. This modularity enables direct re-application of the pipeline to new tasks without architecture or dataset changes.

3. Multimodal Optimal Viewpoint Prediction Network (MVPNet)

Inputs and Feature Encoding

2D Input: Binary mask of the target object (segmentation via grounding detector plus segmenter).
3D Input: Masked point cloud from the depth image.

Separate encoders process each stream:

Mask: ResNet-18 backbone producing 2D token map.
Point Cloud: Dual-stream PointNet++ and PointNeXt producing two sets of 3D tokens.

Tokens from both modalities are concatenated; a learnable [CLS] token is prepended.

Cross-Attention Transformer Fusion

The multimodal token sequence is processed through L layers of a Transformer encoder. Within each layer, cross-attention aligns the 2D and 3D representations:

$\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V,$

where $Q = W_Q F, K = W_K F, V = W_V F$ and $F$ contains concatenated tokens from all input modalities.

Pose Regression

After L Transformer layers, the updated [CLS] token feature $\mathbf{f}_{\mathrm{CLS}}$ is input to an MLP that regresses the six-DoF camera adjustment:

$\Delta \mathbf{T} = (\Delta \mathbf{t}, \Delta \mathbf{q}) = f_\theta(\mathbf{f}_{\mathrm{CLS}}),$

with output parameterized as relative translation ( $\mathbb{R}^3$ ) and quaternion rotation ( $\mathbb{R}^4$ ).

4. Training Strategy and Synthetic Dataset Construction

Loss Functions

Supervision jointly penalizes translation and rotation errors:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{trans}} + \lambda \mathcal{L}_{\text{rot}},$

with

$\mathcal{L}_{\text{trans}} = \|\hat{\mathbf{t}} - \mathbf{t}\|_2^2, \qquad \mathcal{L}_{\text{rot}} = 2\cos^{-1}\left(|\langle\hat{\mathbf{q}}, \mathbf{q}\rangle|\right),$

where $(\hat{\mathbf{t}}, \hat{\mathbf{q}})$ are predicted camera adjustments and $(\mathbf{t},\mathbf{q})$ are ground-truth.

Dataset Generation

Synthetic data is rendered using NVIDIA Isaac Sim, with approximately 17,000 samples created with extensive domain randomization spanning object diversity, spatial configurations, illumination, backgrounds, and initial camera positions. For every sample:

The optimal viewpoint is determined as per Section 2.
The training target is the relative transform:

$\Delta\mathbf{t} = \mathbf{R}(\mathbf{q}_\mathrm{cam})^\top \big(\mathbf{t}_{\mathrm{best}} - \mathbf{t}_\mathrm{cam}\big), \qquad \Delta\mathbf{q} = \mathbf{q}_{\mathrm{cam}}^{-1} \otimes \mathbf{q}_{\mathrm{best}},$

with $\otimes$ denoting quaternion composition.

No human-annotated labels are necessary; all supervision arises through synthetic optimal-viewpoint mining.

5. Implementation in Robotic Grasping Pipelines

Perception-to-Action Workflow

At deployment, the decision loop comprises:

Acquire RGB-D from a random side-view.
Segment the scene to generate the target mask (Grounding DINO, SAM 2).
Either directly attempt grasping (baseline) yielding success rate $\mathrm{SR_{init}}$ , or:
Process the mask and point cloud through MVPNet, predict $\Delta\mathbf{T}$ , physically reposition the camera, reacquire perception.
Rerun grasp detection on the optimal view, yielding $\mathrm{SR_{opt}}$ .

Experimental Evaluation

Simulation: Evaluated with a Franka arm and wrist-mounted Intel RealSense in Isaac Sim, over 250 randomized grasp trials with 65 similar and 18 novel objects. Metrics include first-try success (SR-1) and five-best average (SR-5). For the Economic Grasp baseline:
- SR-1 increases from 53.4% to 64.0% (similar objects), from 48.0% to 64.7% (novel objects).
- Mean SR-1 improvement is +10.2%, SR-5 improvement is +14.0%.
Real-World: Without post-simulation fine-tuning, grasp success improves from 25.5% to 47.6%; declutter rate rises from 43.3% to 66.7%.
Generality: Deploying MVPNet-predicted viewpoints in alternative grasp pipelines (GraspNet, TRG, GIGA) elicits consistent improvements, evidencing the framework's transferability and task-agnostic design (Qin et al., 20 Jan 2026).

6. Decoupled Design and Applicability

The explicit separation of viewpoint quality evaluation ( $Q(\cdot)$ ) from the prediction architecture is central. By specifying $Q$ externally, the system flexibly supports a wide family of downstream tasks: object silhouette coverage, geometric observability, uncertainty reduction, graspability, and more. Once a suitable $Q$ is implemented, new manipulation or recognition objectives can leverage the same learned inference machinery and data pipeline without modification.

This suggests significant efficiency in both system engineering and retraining for new use cases. Furthermore, the heavy domain randomization in synthetic dataset generation directly supports robust sim-to-real transfer, as experimentally corroborated by the absence of required post-simulation tuning.

The one-shot multimodal active perception framework formalizes an efficient, flexible approach to viewpoint planning in robotic manipulation, characterized by single-step camera adjustment prediction via cross-attentional fusion of segmentation and 3D cues. Its abstract decoupling of optimal-view criteria and demonstrated gains in empirical task performance establish its significance for perception-controlled robotic intelligence (Qin et al., 20 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

A General One-Shot Multimodal Active Perception Framework for Robotic Manipulation: Learning to Predict Optimal Viewpoint (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to One-Shot Multimodal Active Perception Framework.