Papers
Topics
Authors
Recent
Search
2000 character limit reached

ArtiSG: Functional 3D Scene Graph Construction via Human-demonstrated Articulated Objects Manipulation

Published 31 Dec 2025 in cs.RO | (2512.24845v1)

Abstract: 3D scene graphs have empowered robots with semantic understanding for navigation and planning, yet they often lack the functional information required for physical manipulation, particularly regarding articulated objects. Existing approaches for inferring articulation mechanisms from static observations are prone to visual ambiguity, while methods that estimate parameters from state changes typically rely on constrained settings such as fixed cameras and unobstructed views. Furthermore, fine-grained functional elements like small handles are frequently missed by general object detectors. To bridge this gap, we present ArtiSG, a framework that constructs functional 3D scene graphs by encoding human demonstrations into structured robotic memory. Our approach leverages a robust articulation data collection pipeline utilizing a portable setup to accurately estimate 6-DoF articulation trajectories and axes even under camera ego-motion. We integrate these kinematic priors into a hierarchical and open-vocabulary graph while utilizing interaction data to discover inconspicuous functional elements missed by visual perception. Extensive real-world experiments demonstrate that ArtiSG significantly outperforms baselines in functional element recall and articulation estimation precision. Moreover, we show that the constructed graph serves as a reliable functional memory that effectively guides robots to perform language-directed manipulation tasks in real-world environments containing diverse articulated objects.

Summary

  • The paper introduces ArtiSG, a framework that integrates human demonstration-derived kinodynamic priors to embed actionable functional information into 3D scene graphs.
  • It employs a three-stage process—semantic initialization, viewpoint-robust articulation estimation, and interaction-augmented graph refinement—to significantly boost functional element recall.
  • Experimental results demonstrate major performance gains, with functional recall improving from 55.8% to 88.5% and substantial reduction in trajectory estimation errors in dynamic conditions.

Functional 3D Scene Graph Construction from Human Demonstrations for Robotic Manipulation

Introduction

The construction of scene graphs for robotic applications has predominantly focused on semantic and geometric representations, often omitting the crucial functional information necessary for physically grounded interaction with articulated objects. The paper "ArtiSG: Functional 3D Scene Graph Construction via Human-demonstrated Articulated Objects Manipulation" (2512.24845) introduces a framework that directly addresses this gap. It proposes encoding human demonstration data—specifically, manipulation trajectories and articulation mechanisms—into hierarchical, open-vocabulary 3D scene graphs. This approach establishes a robust robotic memory for subsequent task planning and manipulation, effectively linking visual perception to actionable affordances. Figure 1

Figure 1: Human demonstration-derived articulation trajectories are extracted and registered as functional elements, enabling open-vocabulary localization and action priors for manipulation in the constructed scene graph.

System Architecture

ArtiSG is structured in three sequential stages:

  1. Scene Graph Initialization: The system constructs an initial semantic and geometric scene graph using multi-view RGB-D mapping. Object-level nodes and candidate functional elements are identified utilizing instance segmentation and clustering, complemented by top-kk frame selection for optimal viewpoint aggregation. Semantic features are extracted using powerful vision-language encoders, and functional elements are detected with promptable object detectors followed by fine-grained mask segmentation models. Results from multiple views are fused to improve robustness and recall.
  2. Viewpoint-Robust Articulation Estimation: The crucial innovation is a hardware-assisted data collection pipeline. A head-mounted RGB-D camera tracks a UMI gripper with a polyhedral ArUco marker sphere, enabling robust 6-DoF pose estimation even under dynamic operator ego-motion. The system fuses articulated manipulation trajectories with SLAM-derived camera poses and applies PCA/SVD-based routines to classify prismatic and revolute articulations and recover respective axes. Figure 2

    Figure 2: System overview of ArtiSG, depicting the staged progression from multi-view semantic aggregation to viewpoint-robust kinodynamic estimation and final interaction-driven graph refinement.

    Figure 3

    Figure 3: The marker-equipped UMI gripper and optitrack hardware foundation enable precise 6-DoF articulation trajectory acquisition and evaluation.

  3. Interaction-Augmented Graph Refinement: Manipulation-derived trajectories and articulation parameters are geometrically aligned with the graph nodes. When human interactions expose functional elements missed in visual initialization due to occlusion or implicitness, new nodes are dynamically instantiated. For elements already detected, articulation and kinematic priors are explicitly attached to the corresponding graph nodes, enhancing the graph's actionability.

Functional Element Detection and Scene Graph Quality

Evaluation on both simulation (Behavior-1k) and real-world (kitchen, pantry, tabletop) environments demonstrates ArtiSG's clear superiority in functional element recall and generalizability over baselines including OpenFunGraph and Lost&Found. Notably, introducing human demonstration data boosts functional element recall in real-world scenes from 55.8% to 88.5% and overall F1 score, reflecting effective compensation for the limitations of static vision-only approaches.

ArtiSG's open-vocabulary querying mechanism, supported by top-kk frame aggregation and advanced vision-LLMs, yields high-fidelity node representations. This enables precise instance retrieval and supports complex language-driven manipulation tasks. Visualization results illustrate the enhanced localization accuracy and completeness achieved, particularly with inconspicuous or ambiguous elements. Figure 4

Figure 4: Qualitative comparison shows ArtiSG's superior localization and recall of ground-truth functional elements over baselines in both real and simulated domains.

Articulation Axis and Trajectory Estimation

Articulation parameter estimation is quantitatively validated using an OptiTrack motion capture system. Compared to vision-based hand/keypoint tracking (Mediapipe, CoTracker) and static vision approaches (GFlow), the marker-based trajectory acquisition delivers an order-of-magnitude improvement in trajectory and axis estimation error, especially in dynamic conditions and for revolute joints. The approach is robust to operator movement and viewpoint changes, outperforming state-of-the-art trackers even under challenging conditions (trajectory RMSE reduced to 0.82 cm in dynamic scenarios). Figure 5

Figure 5: Viewpoint-robust articulation tracking is visualized for both prismatic and revolute joints; the decoupled marker system enables high-precision 6-DoF pose recovery.

Robotic Manipulation via ArtiSG Memory

Downstream utility is demonstrated by integrating ArtiSG with a Franka Research 3 arm for language-guided opening tasks. The system is evaluated on objects with inconspicuous functional elements, atypical kinematics, or ambiguous mechanisms. While even advanced VLMs frequently mispredict affordance locations and articulation types under visual ambiguity or partial observability, ArtiSG leverages demonstration-derived graph memory to retrieve precise interaction trajectories and axis constraints. This enables successful, physically grounded execution of complex manipulation tasks. Figure 6

Figure 6: Robots actuate accurately on objects with subtle or visually ambiguous mechanisms by following trajectories stored in ArtiSG, in contrast to pure VLM-based approaches that fail under limited visual cues.

Theoretical and Practical Implications

ArtiSG advances the state of embodied scene understanding in two significant directions:

  • The integration of human-interaction-derived kinodynamic priors explicitly augments perception-based scene representations, bridging the gap between static detection and actionable affordances.
  • Robustness to diverse viewpoints and unconstrained environments is realized via hardware decoupling and dynamic graph refinement, pushing the limits of manipulation policy generalizability and reliability.

Practically, ArtiSG supports a broader repertoire of manipulation tasks, especially in open, unstructured settings where static perception inevitably misses implicit or occluded affordances. The approach constitutes a step toward more universally applicable functional scene representations critical for lifelong robot learning and task transfer.

Future Work

Future prospects include removing dependence on marker-based hardware in favor of markerless, vision-based pose estimation to further facilitate in-the-wild deployment, and tighter integration of the functional prior graph into generalist robot manipulation policy learning. Such integration would allow explicit kinodynamic constraints and affordance knowledge stored in scene graphs to guide diffusion policies and task sequences in complex, dynamic human environments.

Conclusion

The ArtiSG framework systematically enhances 3D scene graphs by embedding actionable functional information derived from human demonstrations, offering a robust, interaction-augmented foundation for next-generation robot manipulation in complex real-world environments. Its design, combining scene semantics, viewpoint-robust kinodynamic capture, and open-vocabulary functional querying, enables precise, language-guided interventions and provides a scalable pathway for future advances in embodied AI and robotic autonomy.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper introduces ArtiSG, a way to build a smart 3D “map” of a room that doesn’t just know what objects are where, but also understands how their parts work. Think of things like doors, drawers, buttons, and handles. ArtiSG watches a person demonstrate how to use these objects and turns that into a kind of memory the robot can use later to manipulate the same objects on its own.

What questions did the researchers ask?

They focused on three simple questions:

  • How can a robot’s 3D map include not just object names and shapes, but also how their parts move (for example, sliding vs. rotating)?
  • How can we collect good movement data even if the camera or person is moving around?
  • How can we help the robot find small, hard-to-see parts (like tiny handles) that normal vision systems often miss?

How did they do it?

The approach has three main steps. You can imagine building a detailed guidebook for the robot: first you draw the map, then you add “how to use” tips, and finally you fill in any missing details.

Step 1: Build a smart 3D map

  • The system scans a room with a camera to make a 3D point cloud (a detailed set of dots representing surfaces).
  • It finds objects (like a cabinet or microwave) and also tries to spot their functional parts (like handles and knobs) using strong vision tools.
  • It picks the best camera views (“top-k frames”) where the parts are most visible, so it can understand them better and avoid confusion from bad angles.
  • It stores both geometry (shape and position) and open-vocabulary features (semantic clues from language, like “handle” or “knob”), so later you can search with natural words.

Step 2: Learn how parts move from human demos

  • A person uses a simple handheld gripper with a ball covered in special markers. A head-mounted camera watches this and tracks the gripper’s 3D position and orientation over time. This gives a clean “6-DoF” trajectory, which means the full pose in 3D (x, y, z position plus roll, pitch, yaw rotation).
  • As the person opens or closes something, the system records the motion, smoothing it with a filter so it’s not jittery.
  • It then figures out the type of movement:
    • Prismatic joint: slides along a straight line (like a drawer).
    • Revolute joint: rotates around an axis (like a door on hinges).
  • Using math tools that fit lines or circles, it estimates the movement axis direction and the center point. In everyday terms, it discovers the “track” the part moves on and where it rotates or slides.

Step 3: Add missing details using interaction

  • The system matches the recorded motion to the map. If it finds the element already in the map, it attaches the movement info to that element. If it doesn’t, it creates a new “functional element” node (for example, a hidden latch it didn’t see before).
  • This turns the map into a functional memory: not just where parts are, but how to use them.

What did they find?

  • ArtiSG is better at finding small functional parts than other methods, especially in real-world scenes. Watching human demonstrations helps uncover handles or buttons that were too small, hidden, or hard to detect with visual tools alone.
  • It estimates how parts move more accurately than methods that only look at static images or rely on tracking texture in videos. The special gripper and camera setup stays reliable even if the person or camera moves around.
  • The resulting 3D scene graph is useful: when a robot is told in natural language to “Open the cabinet” or “Open the microwave,” ArtiSG can look up the right part and its movement trajectory and guide the robot to perform the action successfully.

Why does this matter?

Robots need more than names and locations—they need to know how things work to interact safely and effectively. With ArtiSG:

  • Robots can learn how to manipulate everyday objects by watching humans, much like how kids learn by observing.
  • The robot’s internal “map” becomes a practical how-to guide, not just a picture. This makes home, office, and kitchen tasks more reliable.
  • It handles tricky cases like unusual doors, flip-down panels, or tiny hidden elements that confuse vision-only methods.

Potential impact and future directions

This research could make robots better helpers in the real world—opening drawers, doors, and appliances, and following natural language instructions more reliably. In the future, the team plans to:

  • Make the setup even more portable by removing printed markers (markerless tracking).
  • Combine ArtiSG’s kinematic “know-how” with general robot skills, so robots can plan and execute tasks faster and with fewer mistakes.

Knowledge Gaps

Below is a consolidated list of concrete knowledge gaps, limitations, and open questions left unresolved by the paper. These items are intended to guide future research and engineering efforts.

  • Reliance on instrumented demonstrations: The approach depends on a UMI gripper with an ArUco-marked tracking sphere and a head-mounted RGB-D device with SLAM, limiting portability and scalability; it remains unclear how to replace markers and specialized hardware with robust markerless tracking while maintaining accuracy under ego-motion and occlusions.
  • Limited kinematic expressivity: Only prismatic and revolute joints are modeled; compound, coupled, or higher-DoF mechanisms (e.g., planar joints, screw joints, sliders with limits, linkages, compliant or elastic closures) are not supported or detected.
  • Missing joint limits and dynamics: Estimated models lack joint bounds, directionality (e.g., open vs. close), friction, damping, and torque/force profiles—information often needed for safe and robust execution.
  • Uncertainty quantification absent: Axis and trajectory estimation provide point estimates without uncertainty/confidence measures; there is no propagation of uncertainty into graph attributes or downstream planning.
  • Sensitivity to short or noisy demonstrations: The axis-fitting procedure (SVD + least squares) can be ill-conditioned for small motion arcs, partial trajectories, or demonstrations with slip; no analysis of minimal motion length, SNR requirements, or robustness to outliers (e.g., no RANSAC-based fitting).
  • Slippage and non-rigid contacts: The method assumes rigid coupling between gripper and functional element; no detection or correction for slippage, soft contacts, or deformable handles/panels.
  • Sparse association logic: Trajectory-to-node association uses nearest-centroid with a threshold, which may fail in dense or cluttered settings with multiple nearby elements; no probabilistic data association, geometric constraint checking, or learned association models are explored.
  • Incomplete element geometry for interaction-discovered nodes: When a functional element is missed by vision and later instantiated from demonstrations, its geometric representation may be poor or absent beyond a centroid; methods to reconstruct or refine geometry from limited views are not provided.
  • Narrow prompt set for element detection: Element discovery uses fixed prompts (e.g., “handle”, “knob”), likely missing buttons, latches, hinges, touch panels, sliders, magnetic catches, or push-to-open panels; the paper does not investigate systematic expansion to a richer, task-driven, or learned open-vocabulary for element classes.
  • Multi-view semantic consistency not enforced: While top-k frames are used, there is no explicit cross-view consistency checking or label fusion to resolve contradictory detections/segmentations from vision-LLMs.
  • SLAM dependence and drift: Robustness to SLAM drift, relocalization failures, rolling-shutter effects, and device variability is not measured; there is no loop-closure-informed correction or global alignment verification for the recorded manipulation trajectories.
  • Calibration assumptions untested: The method assumes fixed, pre-calibrated transforms (sphere-to-tip, camera intrinsics/extrinsics); procedures for on-the-fly calibration, drift detection, or auto-recalibration in the wild are not studied.
  • Handling occlusion and visibility loss: The ArUco pose estimation relies on marker visibility; failure modes under prolonged occlusions, fast motion blur, and adverse lighting are not quantified, nor are recovery strategies described.
  • Dataset scale and diversity: Real-world evaluation covers 79 articulated objects and 139 elements in a few scenes; there is no analysis across diverse object categories, materials (glossy/transparent), lighting, clutter levels, and cultural/industrial environments.
  • Baseline coverage and fairness: Comparisons omit interaction-centric baselines (e.g., Ditto-style before/after methods with mobile cameras) under comparable conditions; joint-type misclassification rates and ablations on the model selection penalty are not reported.
  • Limited downstream evaluation: Robot experiments are qualitative on a handful of objects without statistics on success rate, time-to-completion, safety incidents, or robustness to pose/placement variations, lighting shifts, or repeated trials.
  • Transfer across embodiments: The paper does not analyze how stored human-demonstrated trajectories transfer to robots with different kinematics, end-effectors, reachability constraints, or compliance properties; re-targeting strategies and feasibility checks are missing.
  • World-to-robot frame alignment: Application details for registering the scene graph/world frame to the robot base frame are not explained; calibration pipelines and their error impact remain unreported.
  • Real-time performance and resource use: There is no profiling of computation time, latency, and memory for top-k selection, open-vocabulary feature extraction, multi-view fusion, and tracking; real-time viability on embedded platforms is unknown.
  • Lifelong updates and change handling: The graph does not model or detect changes in articulation state, environment rearrangements, or element wear/failure over time; policies for updating, versioning, or forgetting stale functional information are not addressed.
  • Sequential affordances and dependencies: Multi-step interactions (e.g., unlatch before opening), inter-element constraints, or dependency graphs (e.g., which element enables another) are not modeled.
  • Symmetry and duplication: The system does not reason about symmetric or repeated elements (e.g., cabinet pairs), which could aid detection, association, and trajectory reuse.
  • Safety and compliance: The paper does not consider safety constraints, force control, compliance, or sensing for delicate operations; replaying trajectories without force feedback may risk damage.
  • Language grounding depth: Open-vocabulary retrieval is evaluated via R@k, but complex language referring to function, relational context, or instruction parsing (e.g., “press the release latch above the left hinge”) is not tested; multilingual robustness is unexplored.
  • Multi-user and inter-demonstration variability: The effect of different users’ demonstrations (speed, path style, noise) and methods for aggregating multiple demos into a canonical, uncertainty-aware model are not studied.
  • Active data collection strategies: No method is proposed to prioritize which elements to demonstrate, how to cover a space efficiently, or how to autonomously request demonstrations when uncertainty is high.
  • Ethical/privacy considerations: Head-mounted recording in human environments raises privacy concerns; policies for anonymization, on-device processing, or consent management are not discussed.
  • Reproducibility and release: There is no explicit commitment to release code, datasets, calibration files, or standardized evaluation protocols for the community to reproduce and extend results.

Practical Applications

Immediate Applications

The following applications can be deployed with the current ArtiSG framework and its portable hardware setup (head-mounted RGB-D camera with SLAM and UMI gripper + ArUco sphere), plus off-the-shelf VFM tools (Grounding DINO, SAM, SigLIP2) and standard robotics stacks (e.g., ROS/MoveIt).

  • Commissioning mobile manipulators by “teach-and-repeat”
    • Sector: robotics (service, household, hospitality), manufacturing (workcell setup), education/research labs
    • Application: Rapidly onboard robots in a new environment by scanning, demonstrating openings (doors, drawers, appliances), auto-fitting articulation axes, and storing trajectories to the functional scene graph. Language commands (“Open the microwave”) retrieve the right element and 6-DoF path.
    • Tools/Workflow: ArtiSG Capture Kit (head-mounted RGB-D + UMI gripper), Functional Scene Graph Server with open-vocabulary query, ROS2 node that publishes kinematic priors to controllers, top-k view selection for semantic robustness.
    • Assumptions/Dependencies: Human demonstrations available; robot has sufficient reach and compliance; prismatic/revolute motions cover the object; marker visibility for tracking; environment SLAM quality.
  • Facility functional twins for operations and maintenance
    • Sector: facilities management, construction/BIM, real estate
    • Application: Build a “functional memory” of buildings—doors, cabinets, panels—annotated with handle locations, articulation types/axes, and safe manipulation trajectories. Use for onboarding staff, maintenance scheduling, and asset documentation.
    • Tools/Workflow: Facility Functional Twin database, multi-room scanning + demonstration pass, open-vocabulary retrieval (“find flip-down panels”), integration with BIM viewers and CMMS.
    • Assumptions/Dependencies: Indoor geometry mapped with RGB-D; prompts cover typical functional elements (“handle”, “knob”); privacy/compliance for recording; compute for VFM inference.
  • Assistive and domestic robots: learn-by-demonstration for daily tasks
    • Sector: healthcare (assistive living), smart home
    • Application: Caregivers or residents demonstrate opening appliances, cabinets, medication drawers; robots reuse stored trajectories to perform tasks reliably despite visual ambiguity.
    • Tools/Workflow: Teach-at-home app to record demonstrations; on-device functional graph query; safe motion executor with grasp/force limits.
    • Assumptions/Dependencies: Clear handle access during demo; robot dexterity/gripper compatibility; safe demonstration protocols; marker tracking continuity.
  • AR guidance for technicians and users
    • Sector: education, facilities, industrial training
    • Application: Overlay element locations and articulation directions (e.g., “pull along axis,” “rotate around here”) to reduce training time and errors when operating unfamiliar devices.
    • Tools/Workflow: AR headset/app reading the functional scene graph; open-vocabulary search to highlight elements (e.g., “show all flip-down compartments”).
    • Assumptions/Dependencies: Accurate registration between AR view and mapped environment; dependable SLAM; multi-view feature aggregation to avoid occlusion-induced errors.
  • Data curation and QA for vision foundation models on functional elements
    • Sector: software/AI, academia
    • Application: Use interaction-augmented detection to find elements missed by VFM (handles/buttons), improving recall and building labeled 3D datasets for training/evaluation.
    • Tools/Workflow: Interaction-Augmented Labeler that fuses demonstrations with SAM/Grounding DINO detections; metrics for recall/precision and R@k retrieval; batch export to dataset formats.
    • Assumptions/Dependencies: Sufficient demo coverage; consistent world-frame registration; policy for data privacy and annotation quality.
  • Robotic manipulation research: kinematic priors for planning and policy learning
    • Sector: academia, robotics R&D
    • Application: Plug ArtiSG kinematic priors into planners and imitation/RL policies to reduce search and failure in articulated object manipulation; benchmark experiments on prismatic/revolute joints under ego-motion.
    • Tools/Workflow: Kinematic Prior Plugin for planners (MoveIt, TAMP), dataset and evaluation suite (trajectory RMSE, axis angle/position errors), integration with language-conditioned policies.
    • Assumptions/Dependencies: Access to the capture hardware; reproducible calibration; availability of challenging object sets (textureless/occluded).
  • Rapid changeover in flexible manufacturing fixtures
    • Sector: manufacturing
    • Application: Demonstrate how to operate new fixtures, clamps, and access panels; store articulated motions for robots to perform repeatable setup/tear-down without bespoke programming.
    • Tools/Workflow: Workcell scan + demonstration; scene graph-backed macro scripts for setup operations; safety checks with axis constraints.
    • Assumptions/Dependencies: Fixtures’ motions fit prismatic/revolute models; industrial safety compliance; minimal occlusions during capture.
  • Emergency and field robotics “just-in-time” operation mapping
    • Sector: public safety, utilities, maintenance
    • Application: Rapidly record critical operations (e.g., opening access cabinets, switching panels) via human demonstration; teleoperated/field robots replay trajectories reliably in visually degraded conditions.
    • Tools/Workflow: Portable capture kit, lightweight graph server, offline replay with path verification; language prompts to query critical elements.
    • Assumptions/Dependencies: Short capture windows; clear marker visibility; safe force application; compliance with site access/security.

Long-Term Applications

These applications require further research, scaling, standardization, or hardware advancements (e.g., markerless tracking, broader articulation types, large-scale deployments).

  • Markerless manipulation tracking and ultra-portable capture
    • Sector: robotics, software/AI
    • Application: Replace the ArUco sphere with markerless hands/gripper tracking and robust 3D keypoint recovery under occlusion for in-the-wild collection.
    • Tools/Workflow: Foundation models for 3D hand/gripper pose; temporal fusion with SLAM; uncertainty-aware filters; Markerless ArtiSG pipeline.
    • Assumptions/Dependencies: Reliable markerless tracking on textureless surfaces; occlusion resilience; high-fidelity calibration; compute on edge devices.
  • City- or campus-scale functional maps and interoperability standards
    • Sector: policy, smart cities, building automation
    • Application: Maintain standardized functional scene graphs across buildings for emergency response, accessibility audits, and automation; define APIs so robots/building systems can consume a common “functional twin.”
    • Tools/Workflow: Functional Graph Standard (schema, kinematic attributes, vocabularies), integration with BIM/IFC and automation standards (e.g., OPC UA), governance and privacy frameworks.
    • Assumptions/Dependencies: Stakeholder buy-in; data-sharing policies; versioning and change tracking; cybersecurity; multilingual open-vocabulary support.
  • Manufacturer-provided “function cards” for appliances
    • Sector: consumer electronics, industrial equipment
    • Application: Ship appliances with machine-readable articulation graphs (element locations, axes, recommended trajectories), enabling plug-and-play robotic operation without on-site demos.
    • Tools/Workflow: Function Card metadata embedded in QR/NFC or cloud; robot reads and adapts trajectory to local geometry; alignment procedures.
    • Assumptions/Dependencies: Industry adoption and standardization; calibration pipelines for geometric alignment; legal/safety liability clarity.
  • Autonomous discovery of functional elements via active exploration
    • Sector: robotics
    • Application: Robots plan exploratory interactions to discover handles/buttons and infer articulation axes without human demos, using ArtiSG priors to guide safe probing and learning.
    • Tools/Workflow: Active perception policies with uncertainty-driven actions; closed-loop tracking and axis fitting; risk-aware contact strategies.
    • Assumptions/Dependencies: Reliable contact sensing; safe exploration constraints; robust recovery from failed probes; more general articulation models.
  • Integrated task-and-motion planning with functional priors
    • Sector: robotics, software/AI
    • Application: Use ArtiSG’s kinematic memory within TAMP/MBRL pipelines to plan multi-step tasks (e.g., open door, retrieve item, close) with reduced search complexity and failure rates.
    • Tools/Workflow: Functional-TAMP integration layer; constraint solvers consuming axis/trajectory attributes; probabilistic reasoning over ambiguous elements.
    • Assumptions/Dependencies: Accurate priors under distribution shift; richer element semantics (stiffness, limits, required forces); multi-object coordination.
  • Compliance auditing and accessibility certification
    • Sector: policy, facilities, healthcare
    • Application: Automate verification of egress routes, opening directions, reachability, and assistive access (e.g., door forces, handle heights) using functional maps and measured trajectories.
    • Tools/Workflow: Accessibility Auditor that compares stored kinematics against local codes; standardized reporting; remediation guidance.
    • Assumptions/Dependencies: Codified thresholds for forces/angles; sensors for force/torque; regulatory acceptance; periodic re-validation workflows.
  • Industrial process safety and energy infrastructure operation
    • Sector: energy, utilities, process industries
    • Application: Map valves, breaker panels, and access mechanisms with articulation info for robots to safely operate critical controls during routine maintenance or emergencies.
    • Tools/Workflow: Hazard-aware capture protocols; high-precision kinematic fitting including more joint types; integration with SCADA/DCS systems for authorization and logging.
    • Assumptions/Dependencies: Joint models beyond prismatic/revolute (compound/continuous); strict safety certification; robust operation under harsh conditions.
  • Education and workforce development modules
    • Sector: education, training
    • Application: Curricula where students build functional scene graphs, study articulation estimation, and deploy robots that act from language queries and stored demonstrations.
    • Tools/Workflow: Classroom kits (RGB-D headset, UMI gripper), open-source datasets and benchmarks, scaffolded labs on kinematics and VFM integration.
    • Assumptions/Dependencies: Affordable hardware; institutional support; curated object sets; reproducible evaluation.
  • Retail, logistics, and micro-fulfillment automation
    • Sector: retail, logistics
    • Application: Robots open storage units, micro-fulfillment cabinets, and packaging fixtures learned via demos; reduce manual intervention and accelerate restocking.
    • Tools/Workflow: Fulfillment Functional Graph per site; batch demonstrations for high-frequency elements; scheduling and collision avoidance with human workers.
    • Assumptions/Dependencies: Reliable access paths; high-throughput coordination; resilience to layout changes; worker safety protocols.
  • Enhanced digital twin simulation for robot validation
    • Sector: simulation/virtualization, software/AI
    • Application: Simulators consume ArtiSG graphs to emulate realistic articulated interactions, enabling policy validation and what-if analyses before deployment.
    • Tools/Workflow: Import/export adapters for Isaac/Unity; stochastic models over axes and tolerances; automated scenario generation from functional graphs.
    • Assumptions/Dependencies: Fidelity of articulation models; alignment between sim and real kinematics; dataset breadth for generalization.

Glossary

  • 3D scene graph: A graph-based representation of a 3D environment where nodes (objects/elements) and edges (relationships) encode geometry and semantics for reasoning and planning. Example: "3D scene graphs have empowered robots with semantic understanding for navigation and planning"
  • 6-DoF: Six degrees of freedom describing a rigid body's pose in 3D (3 for position, 3 for orientation). Example: "estimate 6-DoF articulation trajectories and axes even under camera ego-motion."
  • Adaptive Kalman filter: A recursive state estimator that adaptively tunes its noise parameters to smooth and denoise measurements (e.g., poses) over time. Example: "using an adaptive Kalman filter"
  • ArUco marker: A square fiducial marker with a known ID used for robust camera-based pose estimation. Example: "we detect visible ArUco markers on the sphere."
  • Articulated object: An object composed of linked parts that move relative to each other via joints (e.g., doors, drawers). Example: "particularly regarding articulated objects."
  • Articulation axis: The line (position and direction) that defines how a part moves under a joint (e.g., hinge axis or slide direction). Example: "an articulation axis Aj={pc,pd}\mathbf{A}_j = \{\mathbf{p}_{c}, \mathbf{p}_{d}\}"
  • Back-projection: The process of lifting 2D image pixels (e.g., masks) into 3D coordinates using camera geometry and depth. Example: "These 2D part masks are back-projected into 3D space"
  • DBSCAN: Density-Based Spatial Clustering of Applications with Noise; a clustering algorithm that groups dense regions and filters outliers. Example: "utilize DBSCAN clustering to remove outliers"
  • Ego-motion: Motion of the camera (observer) itself, which complicates tracking and estimation. Example: "even under camera ego-motion."
  • Extrinsic parameters: Camera parameters that describe the pose (position and orientation) of the camera in the world or another frame. Example: "using the camera's extrinsic and intrinsic parameters."
  • Intrinsic parameters: Camera parameters that describe the internal imaging geometry (e.g., focal length, principal point). Example: "using the camera's extrinsic and intrinsic parameters."
  • Kinematic mechanism: The motion model of an articulated part, specifying how it can move under joint constraints. Example: "that defines its kinematic mechanism"
  • Kinematic priors: Prior knowledge or constraints about allowable motions (e.g., joint types/axes) used to guide perception or planning. Example: "We integrate these kinematic priors into a hierarchical and open-vocabulary graph"
  • Non-linear least squares: An optimization method that minimizes squared residuals in problems where the relation between parameters and observations is non-linear. Example: "via non-linear least squares optimization"
  • Open-vocabulary: A modeling approach that can recognize or retrieve concepts beyond a fixed label set using learned semantic embeddings. Example: "compute open-vocabulary features"
  • Perspective-n-Point (PnP): A pose estimation method that recovers camera-to-object transformation from 2D–3D point correspondences. Example: "via a Perspective-n-Point (PnP) solver"
  • Point cloud: A set of 3D points representing scene geometry, often derived from RGB-D sensing or reconstruction. Example: "generating the RGB point cloud of the scene."
  • Principal Component Analysis (PCA): A technique that finds orthogonal directions of maximum variance; used here to infer motion axes/planes. Example: "based on Principal Component Analysis (PCA) and non-linear optimization."
  • Prismatic joint: A joint allowing linear sliding motion along a single axis. Example: "For the prismatic joint where motion follows a 3D line"
  • R@k (Recall@k): A retrieval metric indicating whether the correct item appears within the top-k ranked results. Example: "We also utilize the query success rate R@kk as defined in OpenFunGraph"
  • Reprojection error: The pixel-space error between observed 2D points and reprojected 3D points given an estimated pose; used to assess PnP quality. Example: "based on the PnP reprojection error."
  • Revolute joint: A joint allowing rotation around a fixed axis (e.g., a hinge). Example: "For the revolute joint where motion follows a circular arc"
  • Root Mean Squared Error (RMSE): A measure of average magnitude of error, computed as the square root of mean squared differences. Example: "We calculate the trajectory RMSE TerrT_\mathrm{err}"
  • Simultaneous Localization and Mapping (SLAM): Algorithms that estimate a sensor’s pose while building a map of the environment. Example: "the camera utilizes built-in SLAM"
  • Singular Value Decomposition (SVD): A matrix factorization used to find principal directions; here, to estimate motion axes and planes. Example: "we apply Singular Value Decomposition (SVD) to the centered points"
  • Top-k frames: The subset of frames with the highest utility (e.g., visibility) chosen for processing or aggregation. Example: "we select the top-kk frames"
  • Vision foundation models: Large, general-purpose vision models used for tasks like detection or segmentation without task-specific training. Example: "turn to vision foundation models for functional element segmentation"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.