Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bimanual Active Perception in Robotics

Updated 9 February 2026
  • Bimanual Active Perception (BAP) is a robotic manipulation paradigm combining coordinated two-arm actions with active sensing to overcome occlusions in dynamic tasks.
  • It employs human demonstrations and imitation learning to optimize multi-stage, occlusion-prone tasks using synchronized arm and camera motion.
  • Empirical benchmarks reveal that BAP systems outperform static or wrist-only approaches in precision, generalization, and long-horizon task success.

Bimanual Active Perception (BAP) is an integrated paradigm in robotic manipulation wherein a robot with two arms combines coordinated bimanual actions with deliberate, on-demand active sensing (primarily visual), so as to resolve occlusions, gather missing task-relevant information, and achieve precise, robust manipulation in unstructured environments. BAP entails both the hardware capability of actively controlling a head-eye or eye-in-hand sensor, and learning policies that exploit this capability, typically via imitation learning from human demonstrations. Recent works formalize BAP as central for overcoming the partial observability and complexity inherent in multi-stage, occlusion-prone manipulation tasks, and empirically show that BAP-empowered systems outperform static-camera or wrist-eye-only approaches in generalization, precision, and long-horizon task success (Xiong et al., 18 Jun 2025, Zeng et al., 2 Oct 2025, He et al., 2 Feb 2026).

1. BAP Problem Formulation and Motivation

BAP is defined as the joint selection of bimanual end-effector actions and active perceptual actions (e.g., “gaze” direction) at every timestep, in order to maximize task completion under partial observability and physical occlusion. Modern formulations cast BAP as a partially observable Markov decision process (POMDP), where the robot policy πθ\pi_\theta must issue, at each time tt, both arm manipulations and active-viewpoint decisions:

  • State sts_t: latent world state (object, robot arms, and camera/neck configuration, often including force).
  • Observation oto_t: at minimum, RGB(-D) image(s) from head or wrist-mounted cameras, proprioceptive state, and sometimes force/torque data.
  • Action ata_t: includes both end-effector movements for each arm (atL,atRSE(3)a^L_t, a^R_t \in SE(3)), and separate commands for the “head” or active camera (atHSE(3)a^H_t \in SE(3)), or, in the arm-based BAP variant, a non-operating arm carrying a camera.

This approach directly addresses key shortcomings of prior static or wrist-camera systems, which (1) only provide viewpoint information implicitly coupled to the current manipulator pose, and (2) struggle in tasks requiring dynamic occlusion resolution, long-horizon search, and fine 3D alignment. In contrast, BAP enables the robot to “look where it needs to” as a human demonstrator would, making both active exploration (“searching” for hidden cues) and focused manipulation (“focusing” for precision tasks) tractable (Xiong et al., 18 Jun 2025, Zeng et al., 2 Oct 2025, He et al., 2 Feb 2026).

2. Hardware and Teleoperation Systems

Recent BAP systems employ distinct hardware strategies for enabling active perception:

  • Active Head/Neck Systems: ViA (Xiong et al., 18 Jun 2025) uses a 6-DoF ARX5 robot arm as a flexible “neck,” mounting an RGB-D sensor (iPhone 15 Pro) to mimic human head-torso kinematics. Bimanual ARX5 arms perform manipulation; all are kinematically coordinated via a global world frame. The teleoperation interface features joint-to-joint mapping for the arms and direct head-pose mapping for the camera, with intermediate point-cloud scene rendering for low-latency, high-fidelity VR feedback.
  • Egocentric VR Backpack Systems: ActiveUMI (Zeng et al., 2 Oct 2025) introduces a portable teleop kit (Meta Quest 3 HMD, sensorized controllers, wearable PC) capturing synchronized 6-DoF bimanual actions and head gaze in-the-wild, then replayed on robot hardware. Head pose and gripper controllers are rigidly calibrated and aligned, logging all kinematics and active view data for cross-domain imitation learning.
  • Arm-Based Active Vision: In settings lacking high-DoF necks, the BAP strategy (He et al., 2 Feb 2026) repurposes the non-operating arm as an “eye-in-hand” sensor, yielding two classes of end-effectors: one for manipulation (with force/torque sensing) and one for active vision. Teleoperation is performed via VR platforms (Pico Ultra 4), and scene views combine static head, wrist, and arm-mounted cameras.

These hardware paradigms support direct capture of human-performed gaze strategies and bimanual coordination, addressing latency, accuracy, and alignment challenges via world-frame synchronization, precise hand–eye calibration, and multi-rate control loops.

3. Learning Frameworks for Bimanual Active Perception

BAP is fundamentally a learning-from-demonstration problem: the robot must imitate not only arm actions but also actively select informative camera viewpoints, based on expert human demonstrations.

  • Diffusion Policy for Multimodal Action Modeling: ViA (Xiong et al., 18 Jun 2025) applies a diffusion-based behavior cloning scheme, learning a backward Gaussian denoising process for stacked arm and neck action sequences. Input observations incorporate a visual embedding (frozen DINOv2 ViT [Oquab et al. 2023]) and proprioception to predict multi-action “chunks,” with the first chunk executed at each policy update.
  • Vision-Language-Action (VLA) Multi-Stream Models: ActiveUMI (Zeng et al., 2 Oct 2025) extends multi-modal transformer architectures to fuse three image streams (left/right wrist and head), multi-DoF proprioception, and explicit head-pose encoding. Late fusion and transformer blocks act on these features to output ΔSE(3)\Delta SE(3) actions for each component. The system can be trained via joint mean square error loss across arms and head motion, with behavior cloning from synchronized VR-collected demonstrations.
  • Transformer and Diffusion-Based Single/Multitask Models: The BAP strategy (He et al., 2 Feb 2026) benchmarks several architectures (e.g., ACT, DP, GR-MG, π0\pi_0), all trained on the BAPData real-robot demonstration set, highlighting the necessity of both visual viewpoint and force feedback streams for complex task success.

Crucially, policies may be trained with or without explicit information gain objectives, but all successful BAP approaches clone head/active-viewpoint motion as performed by skilled demonstrators.

4. Benchmarks and Evaluation

Three principal benchmarks and datasets substantiate the empirical benefits of BAP:

  • Multi-Stage Occlusion Tasks (ViA): The Bag, Cup, and Lime–Pot tasks require sequential exploration, tracking, handover, and high-precision placement under occlusion (Xiong et al., 18 Jun 2025). Results show ViA (active head camera only) achieves 95%, 92%, and 88% stage-wise success, compared to 70%, 55%, and 43% for chest+wrists. Notably, wrist cameras can degrade performance in low-data regimes due to increased view occlusion.
  • In-the-Wild Generalization Tasks (ActiveUMI): Six diverse tasks (e.g., shirt folding, rope boxing, in-bag drink retrieval) demonstrate ActiveUMI’s effectiveness: full-active-perception policies achieve 70% in-distribution and 56% on novel-object/environments, compared to only 26% and 6% for wrist-only systems (Zeng et al., 2 Oct 2025).
  • EFM-10 Benchmark and BAPData (EFM/BAP): Ten tasks spanning exploration, focus, and combined challenges quantify the effect of active-view visibility:

| Visibility | Toy-Match (%) | Cup-Hang (%) | Nail-Knock (%) | Charger-Plug (%) | |----------------------|---------------|--------------|----------------|------------------| | None | 20.0 | 23.3 | 6.7 | 0.0 | | Area only | — | 76.7 | 26.7 | 13.3 | | Area + Effector | 76.7 | 90.0 | 43.3 | 20.0 |

Modern policy architectures (GR-MG, π0\pi_0) outperform baselines, but delicate insertion/focus tasks remain difficult. Including real-time force/torque readings improves performance on tasks requiring compliance (He et al., 2 Feb 2026).

5. Emergent Strategies and Observed Behaviors

Trained BAP policies demonstrate the spontaneous emergence of several archetypal active perception strategies, mirroring human behaviors:

  • Searching: Head or active arm is dynamically swept to maximize coverage until task-relevant features are detected (e.g., peeking in bags, searching shelf tiers).
  • Tracking: Upon partial observation of a target, the camera follows object motion to maintain continuous visual input for downstream manipulation.
  • Focusing: In final stages, the camera (head or arm) is positioned close to interaction zones (e.g., insertion points, pot rims) for fine spatial detail, supporting sub-centimeter or millimeter-scale alignment.

These behaviors enable BAP systems to actively “peek” into occluded regions, unachievable with fixed or wrist-only cameras, and thus commit to arm actions with improved confidence and reliability (Xiong et al., 18 Jun 2025, Zeng et al., 2 Oct 2025, He et al., 2 Feb 2026).

6. Limitations and Future Directions

Documented limitations and proposed future research directions include:

  • Hardware constraints: Current systems require either a dedicated neck/arm for active vision or high-accuracy hand–eye calibration. Reduced hardware overhead (e.g., pan–tilt heads, omnidirectional cameras) and upper-body co-designs may enable richer gaze behaviors (Xiong et al., 18 Jun 2025, Zeng et al., 2 Oct 2025).
  • Visual fidelity: Single-frame depth noise and sparse point clouds in VR teleoperation limit realism; dynamic fusion (e.g., 4D Gaussian splatting) may improve feedback (Xiong et al., 18 Jun 2025).
  • Memory and language: Most published policies are memory-free and language-agnostic, thus limiting very long-horizon search and multi-stage instruction following. Augmenting models with recurrent memory and language conditioning is a priority for generalization (Xiong et al., 18 Jun 2025, He et al., 2 Feb 2026).
  • Learning objectives: Pure imitation learning suffices for current systems, but integrating reinforcement learning or explicit information gain objectives (e.g., maximizing expected information gain I(View;StateHistory)I(View;State|History)) could further optimize viewpoint selection (Zeng et al., 2 Oct 2025).
  • Task diversity: Scaling to truly open-ended, skill-diverse manipulation remains open; leveraging simulation and web-scale data may be required for the next leap in generalization coverage (Zeng et al., 2 Oct 2025, He et al., 2 Feb 2026).

7. Broader Significance and Outlook

Bimanual Active Perception now stands as a central paradigm for embodied AI, explicitly linking low-level sensorimotor control with real-time information acquisition strategies. By tightly coupling human-inspired gaze and bimanual manipulation, BAP systems can robustly resolve occlusions, operate with minimal “missing information,” and execute complex, multi-stage, real-world tasks with higher reliability and adaptability than previous approaches. The release of curated benchmarks (EFM-10, BAPData), portable demonstration systems (ActiveUMI), and new policy frameworks (ViA, BAP) provides an empirical baseline and infrastructure for future investigation. Remaining challenges include automated viewpoint planning, advanced perception–language integration, dynamic adaptation to unexpected occlusions, and joint exploitation of neck- and arm-based active sensing (Xiong et al., 18 Jun 2025, Zeng et al., 2 Oct 2025, He et al., 2 Feb 2026).

BAP thus defines not just a toolkit or workflow, but a foundational capability for next-generation bimanual robotic autonomy under realistic perceptual constraints.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bimanual Active Perception (BAP).