Explainable Object-Oriented HRI

Updated 28 January 2026

X-OOHRI is a paradigm where robot capabilities and limits are explained through object-centric, AR/VR interfaces, promoting transparent human–robot interactions.
The framework integrates perception, affordance mapping, and XAI attribution to project actionable explanations onto physical or simulated objects.
User studies indicate enhanced task performance and trust due to clear, real-time feedback on robot actions and constraints.

Explainable Object-Oriented Human–Robot Interaction (X-OOHRI) refers to a paradigm in human–robot interaction where the robot’s capabilities, limitations, and internal reasoning are surfaced to users through explicit, object-centric explanations. X-OOHRI merges principles of explainable AI (XAI) and object-oriented programming (OOP), leveraging immersive interfaces—primarily augmented reality (AR) and virtual reality (VR)—to anchor explanations in the physical or simulated scene. The goal is to bridge the gap between the high-dimensional latent spaces of robot policies and the human’s semantic understanding, enabling mixed-initiative control, transparent communication of constraints, and rapid resolution of failures in task execution (Wang et al., 21 Jan 2026, Heuvel et al., 1 Apr 2025).

1. Core Concepts and Objectives

X-OOHRI is defined by two central objectives: making robot action possibilities and limitations transparent at the object level, and enabling direct, semantically-grounded manipulation of both robot actions and environment state. Every relevant real-world or simulated entity—robot or object—is represented as a class instance (e.g., XObject or XRobot) encapsulating attributes (size, weight, position, reach limits, payload, etc.) and affordance methods (such as Pick(), Place(), Stack(), Vacuum()). This object-oriented affordance schema allows the system to expose for each entity the complete set of actions the robot can perform, the constraints influencing these affordances, and context-sensitive explanations when tasks are or are not feasible (Wang et al., 21 Jan 2026).

This approach contrasts with traditional black-box robot controllers or purely task-centric planning interfaces. By projecting affordances, constraints, and reasoning onto tangible scene elements—chairs, books, doors—users are empowered to issue object-centric commands and receive immediate feedback on what is possible, what went wrong, and available recovery paths. The model thus supports shared autonomy and mixed-initiative task resolution.

2. System Architecture and Technical Frameworks

Modern X-OOHRI systems integrate several subsystems, each fulfilling a distinct role in the object-oriented explainability pipeline:

Perception and Semantic Scene Reconstruction: AR/VR headsets use stereo camera feeds to generate spatially-aligned meshes or 'virtual twins' (GhostObjects) for all relevant objects. Scene segmentation is handled by vision-LLMs (VLMs, e.g., GPT-5.1) which output object attributes and affordance methods by estimating $P(\text{attribute}_k \mid o_i)$ via model logits. These are thresholded to instantiate XObject class instances (Wang et al., 21 Jan 2026).
Affordance Mapping and Feasibility Checking: Given the current world state $S$ $S$ , a mapping function $f_\text{affordance}(o, r, S) \rightarrow \{\text{allowed}, \text{disallowed}\}$ checks whether a robot $r$ $r$ can execute method $m$ $m$ on object $o$ $o$ , based on a cascade of parameterized constraints (e.g., reachability, graspability, manipulability):
- $c_1(o, r, S):$ Reachable $\Leftrightarrow o.\text{position} \in \text{ReachMesh}(r, S)$
- $c_2(o, r, S):$ Graspable $\Leftrightarrow \text{size}(o) < \text{MaxAperture}(r)$
- $c_3(o, r, S):$ Manipulable $\Leftrightarrow \text{weight}(o) \leq \text{Payload}(r)$
- If any condition fails, explanations are fetched from a human-readable tag lookup (e.g., “Too high to reach,” “Too heavy”) (Wang et al., 21 Jan 2026).
XAI Attribution and Semantic Projection: For RL-based navigation, a gradient-based XAI backend (specifically, Vanilla Gradient saliency) computes the gradient of the policy’s linear velocity head with respect to each lidar input at runtime: $g = \partial[\pi_\theta(s)]_v / \partial L$ , where $g \in \mathbb{R}^{15}$ . After normalization to $g^* \in [0, 1]$ , these attributions are projected via ray-casting onto scene objects, generating glowing outlines whose thickness encodes object-level importance (Heuvel et al., 1 Apr 2025).
Physical and Virtual Alignment: Virtual twins are spatially collocated with physical objects using spatial anchors and ROS2 transforms; mesh models can be hand-generated (e.g., via Blender) for high fidelity, and calibration updates maintain alignment throughout task execution (Wang et al., 21 Jan 2026).

3. Interface Modalities and Communication Channels

X-OOHRI systems employ multimodal AR/VR interfaces to maximize clarity and transparency:

Signifiers and Actionable Feedback: When a user’s controller ray intersects an actionable GhostObject, a white halo appears, enhancing affordance discoverability.
Radial Menus and Action Enumeration: Objects present radial menus displaying affordance methods $M_i$ . Infeasible actions appear grayed out or omitted, directly reflecting constraint status.
Color Coding: Non-permitted objects highlight in red with associated tags. Green or native color indicates feasibility.
Explanation Tags and Previews: Floating, anchored labels explain the specific constraint causing action disallowance. Selecting an auto- or alternative-resolution animates a preview of the corrective plan.
Direct Manipulation: Users can “grab” or lasso GhostObjects, drag them to new positions, and observe immediate constraint checks and feedback, supporting both low-level and high-level commands (e.g., Pick-and-Place, Fill, region-specified Vacuum) (Wang et al., 21 Jan 2026).

For navigation, VR projections overlay lidar rays (green for free space, red for collisions) and gradient-inspired object outlines—jointly conveying what the robot perceives and what it reasons about (Heuvel et al., 1 Apr 2025).

4. Methodologies for Explanation Generation and Attribution

Explanation generation in X-OOHRI is tightly coupled to the underlying object-oriented schema and system constraints:

For manipulation tasks, the feasibility check traces the constraint tree. Upon denial, the system attaches the label from the tag lookup table indexed by the first failing constraint. A set of permitted resolutions—Auto (system executes corrective plan), Alternative (switch target/object), Ignore (force execution), or none—are displayed as part of the interface for mixed-initiative correction.
For RL-based navigation, attributions derived from policy gradients are normalized and mapped onto scene objects via geometric ray-casting, ensuring that end-user explanations are always grounded in directly perceivable elements. The pipeline separates sensor-level saliency computation from geometric grounding and ultimately from affordance display.
All explanation pathways are designed for in-situ and immediate feedback, minimizing trial-and-error by surfacing potential failures, their causes, and possible recoveries pre-execution (Heuvel et al., 1 Apr 2025, Wang et al., 21 Jan 2026).

5. Evaluation and Empirical Findings

Extensive user studies evaluate both objective task performance and subjective understanding:

Modality	Kendall’s $\tau$ (Object Ranking)	SIPA (Transparency, Intelligibility, Predictability Increase)	Trust/Plausibility Effect
None	0.42 (SD 0.12)	Baseline	Baseline
XAI only	0.52 (SD 0.11)	+0.9 points, significant main effect ( $F=5.39$ , $p=.030$ )	Plausibility not shifted
Lidar only	0.45 (SD 0.13)	+0.6 points	Strong plausibility effect
XAI + Lidar	0.54 (SD 0.10)	Max effect (+1.0 points, $F=6.90$ , $p=.015$ )	Greater subjective gain

In AR object-manipulation studies, all participants (N=14) successfully completed tasks, reporting “good” usability (SUS 79.3), low workload (NASA-TLX), and high self-reported understanding of robot abilities and when assistance would be necessary. Qualitative feedback highlighted the clarity and immediacy of AR-based explanation tags, value of seeing unavailable actions (“grayed out” affordances), and a general preference for combining direct manipulation with additional modalities like voice (Heuvel et al., 1 Apr 2025, Wang et al., 21 Jan 2026).

For VR-based explainability of RL navigation, users exposed to object-projected gradient saliencies achieved higher agreement with ground-truth policy priorities (∆ $\tau$ ≈ +0.10 with XAI). Subjective ratings of transparency, predictability, and plausibility increased with richer modality combinations, though trust ratings were unaffected. Freeform responses suggested that richer explanation led to more accurate and detailed user mental models (Heuvel et al., 1 Apr 2025).

6. Example Applications

X-OOHRI has been demonstrated in a spectrum of HRI tasks:

Pick-and-Place: Users select, drag, and reposition GhostObjects in AR; infeasible movements trigger immediate, semantically-tagged feedback with suggestions for auto-resolutions (e.g., “Too high to reach—Auto: use a stool”).
High-Level Instructions: Parameterized affordances (e.g., Fill(glass, amount)) allow specification using direct manipulation and in-place previews, with all constraint checks and explanations surfaced in the user’s view.
Area-Specified Vacuuming/Room Cleanup: Region lasso selects zones; obstacles preclude actions until manually or automatically resolved, with AR tags (“Occluded,” “Closed door”) anchoring failure causes to physical objects (Wang et al., 21 Jan 2026).
RL Navigation Explainability: Semantic VR overlays ground neural saliencies in robot navigation, enabling scene-level understanding of why path deviations occur and which obstacles influence decisions (Heuvel et al., 1 Apr 2025).

7. Limitations and Future Research Directions

Several open challenges and limitations are identified:

VLM and Perception Errors: Unobservable or deceptive attributes (e.g., weight inside opaque containers) can cause mismatches between visual inference and true feasibility. Future work may integrate multi-view or sensor-based updates and closed-loop execution feedback.
Generalization Beyond Colocated AR: Current frameworks typically assume the user and robot share a physical workspace. Extending X-OOHRI to remote or teleoperated scenarios will require photorealistic VR scene reconstruction and robust spatial alignment.
Adaptive Explanation Granularity: While all resolutions currently require explicit user confirmation, adaptive delegation or full autonomy based on confidence thresholds presents an opportunity for efficiency gains.
Interactive Learning: Actions permitted or denied during interaction can compose a corpus of implicit training signals, enabling affordance models to improve over time based on user correction.

A plausible implication is that X-OOHRI, by tightly coupling object-oriented affordance schemas with XAI explainability and immersive visualization, constitutes a foundational step toward fluid, mixed-initiative collaboration with physical and virtual robots in both domestic and industrial domains (Wang et al., 21 Jan 2026, Heuvel et al., 1 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Explainable OOHRI: Communicating Robot Capabilities and Limitations as Augmented Reality Affordances (2026)

Immersive Explainability: Visualizing Robot Navigation Decisions through XAI Semantic Scene Projections in Virtual Reality (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Explainable Object-Oriented HRI (X-OOHRI).