One-Shot Multimodal Active Perception
- One-shot multimodal active perception is a single-step method that fuses 2D silhouette and 3D geometric cues to predict optimal camera viewpoints.
- The framework decouples the evaluation of viewpoint quality from the learning model, enabling versatile application across various robotic manipulation tasks.
- A cross-attention transformer architecture integrates multimodal inputs to directly regress six-DoF camera adjustments, significantly boosting grasp success rates.
A one-shot multimodal active perception framework for robotic manipulation refers to the approach wherein a robot, from a single observation, predicts in one inference step the camera motion required to reach an optimal, task-informative viewpoint—eschewing multi-step, closed-loop next-best-view strategies. The multimodal aspect denotes the integration of distinct perceptual inputs, typically a 2D object mask and a 3D point cloud, enabling exploitation of both silhouette and geometric cues. Decoupling the viewpoint quality evaluation from the learning architecture confers broad applicability across manipulation tasks and allows task-agnostic dataset generation and training. A cross-attention-based multimodal network architecture is used to directly regress the required camera pose adjustment, and the approach is validated on robotic grasping in both simulation and real-world environments, exhibiting substantial improvements in task success rates and robust sim-to-real transfer without further tuning (Qin et al., 20 Jan 2026).
1. Formal Problem and Approach Overview
One-shot active perception reduces the problem of viewpoint planning to single-step inference: given a current input, the system predicts a six-DoF camera adjustment, avoiding iterative candidate evaluation or sequential next-best-view methods. Multimodal fusion is adopted to leverage object geometry (from depth-based point clouds) and appearance or segmentation cues (from mask images).
Data Collection Pipeline
The pipeline is organized in two primary phases:
- Systematic Candidate Viewpoint Sampling: For each object , legal camera poses are sampled on an admissible viewing manifold (e.g., side-views).
- For each pose, the robot collects RGB-D, precise camera pose, and a semantic mask computed via a detector/segmenter.
- Viewpoint Quality Evaluation & Domain Randomization: Each observation is assigned a viewscore using a task-specific function, commonly instantiated (for grasping) as the average grasp quality metric:
where is the grasp model's output for the -th candidate from view .
- The top viewpoints by are clustered (e.g., via DBSCAN) and the centroid is annotated as the optimal .
- This is repeated under heavy domain randomization (varying object type, pose, scale, illumination, backgrounds, and camera), yielding a synthetic dataset of tuples.
2. Optimal Viewpoint: Definition and Quality Metrics
The framework defines the optimal viewpoint relative to a set of candidates. Each is ranked using
where are model outputs (e.g., grasp scores). The optimal viewpoint is selected as
with chosen for robustness to sampling noise.
Crucially, is external to the perception network. For different downstream requirements (e.g., maximizing visible object surface area, minimizing pose uncertainty), any deterministic or learned task metric can be substituted, yielding an optimal view definition matched to arbitrary manipulation objectives. This modularity enables direct re-application of the pipeline to new tasks without architecture or dataset changes.
3. Multimodal Optimal Viewpoint Prediction Network (MVPNet)
Inputs and Feature Encoding
- 2D Input: Binary mask of the target object (segmentation via grounding detector plus segmenter).
- 3D Input: Masked point cloud from the depth image.
Separate encoders process each stream:
- Mask: ResNet-18 backbone producing 2D token map.
- Point Cloud: Dual-stream PointNet++ and PointNeXt producing two sets of 3D tokens.
Tokens from both modalities are concatenated; a learnable [CLS] token is prepended.
Cross-Attention Transformer Fusion
The multimodal token sequence is processed through L layers of a Transformer encoder. Within each layer, cross-attention aligns the 2D and 3D representations:
where and contains concatenated tokens from all input modalities.
Pose Regression
After L Transformer layers, the updated [CLS] token feature is input to an MLP that regresses the six-DoF camera adjustment:
with output parameterized as relative translation () and quaternion rotation ().
4. Training Strategy and Synthetic Dataset Construction
Loss Functions
Supervision jointly penalizes translation and rotation errors:
with
where are predicted camera adjustments and are ground-truth.
Dataset Generation
Synthetic data is rendered using NVIDIA Isaac Sim, with approximately 17,000 samples created with extensive domain randomization spanning object diversity, spatial configurations, illumination, backgrounds, and initial camera positions. For every sample:
- The optimal viewpoint is determined as per Section 2.
- The training target is the relative transform:
with denoting quaternion composition.
No human-annotated labels are necessary; all supervision arises through synthetic optimal-viewpoint mining.
5. Implementation in Robotic Grasping Pipelines
Perception-to-Action Workflow
At deployment, the decision loop comprises:
- Acquire RGB-D from a random side-view.
- Segment the scene to generate the target mask (Grounding DINO, SAM 2).
- Either directly attempt grasping (baseline) yielding success rate , or:
- Process the mask and point cloud through MVPNet, predict , physically reposition the camera, reacquire perception.
- Rerun grasp detection on the optimal view, yielding .
Experimental Evaluation
- Simulation: Evaluated with a Franka arm and wrist-mounted Intel RealSense in Isaac Sim, over 250 randomized grasp trials with 65 similar and 18 novel objects. Metrics include first-try success (SR-1) and five-best average (SR-5). For the Economic Grasp baseline:
- SR-1 increases from 53.4% to 64.0% (similar objects), from 48.0% to 64.7% (novel objects).
- Mean SR-1 improvement is +10.2%, SR-5 improvement is +14.0%.
- Real-World: Without post-simulation fine-tuning, grasp success improves from 25.5% to 47.6%; declutter rate rises from 43.3% to 66.7%.
- Generality: Deploying MVPNet-predicted viewpoints in alternative grasp pipelines (GraspNet, TRG, GIGA) elicits consistent improvements, evidencing the framework's transferability and task-agnostic design (Qin et al., 20 Jan 2026).
6. Decoupled Design and Applicability
The explicit separation of viewpoint quality evaluation () from the prediction architecture is central. By specifying externally, the system flexibly supports a wide family of downstream tasks: object silhouette coverage, geometric observability, uncertainty reduction, graspability, and more. Once a suitable is implemented, new manipulation or recognition objectives can leverage the same learned inference machinery and data pipeline without modification.
This suggests significant efficiency in both system engineering and retraining for new use cases. Furthermore, the heavy domain randomization in synthetic dataset generation directly supports robust sim-to-real transfer, as experimentally corroborated by the absence of required post-simulation tuning.
The one-shot multimodal active perception framework formalizes an efficient, flexible approach to viewpoint planning in robotic manipulation, characterized by single-step camera adjustment prediction via cross-attentional fusion of segmentation and 3D cues. Its abstract decoupling of optimal-view criteria and demonstrated gains in empirical task performance establish its significance for perception-controlled robotic intelligence (Qin et al., 20 Jan 2026).