- The paper introduces Scene Graph Memory (SGM) and Node Edge Predictor (NEP) to handle partial observability in dynamic object search tasks.
- The methodology employs transformer-based edge classification with GCN and HEAT operators for effective temporal link prediction.
- Results demonstrate that the NEP HEAT variant significantly outperforms baseline models in accuracy and adaptability.
Modeling Dynamic Environments with Scene Graph Memory
Introduction
The paper presents a novel approach to address the challenge of object search by embodied AI agents in large dynamic environments such as households. The authors conceptualize this problem as temporal link prediction on dynamic, partially observable graphs, introducing the Scene Graph Memory (SGM) and Node Edge Predictor (NEP) models to facilitate efficient decision-making under partial observability. The paper evaluates the proposed methods using the Dynamic House Simulator, showing improved adaptability and predictive accuracy over existing methods.
Figure 1: Our problem setup and proposed method: an agent searches for objects in a dynamic household environment, using a Scene Graph Memory aggregated from partial observations (SGM) and a Node Edge Predictor model (NEP) to predict object locations.
Temporal link prediction traditionally involves estimating future edges in a dynamic graph based on complete past observations. This research introduces temporal link prediction under partial observability, a novel variant where the observed data is incomplete and evolving. The task involves predicting relationships between object nodes in dynamically changing scene graphs, a situation that is prevalent in embodied AI applications like object search and navigation.
Methodology
Scene Graph Memory (SGM)
SGM serves as a dynamic data structure, aggregating an agent's observations over time into a cohesive scene graph. This graph encompasses nodes representing objects, rooms, and their relationships (edges), accounting for both observed and hypothetical connections. The SGM is crucial for capturing both the temporal dynamics and semantic context needed for accurate link prediction.
Node Edge Predictor (NEP)
NEP is a neural architecture designed for inference on dynamic, partially observable scene graphs. It consists of modules for node and edge embeddings, a feature fusion process, and a transformer-based edge classification mechanism. The NEP architecture leverages GCN and HEAT operators to enhance the model's capability in predicting the likelihood of unobserved connections effectively.
Figure 2: Node Edge Predictor (NEP) model architecture illustrating GCN and HEAT variants.
Experimental Setup
The study employs the Dynamic House Simulator to benchmark the methods' performance across various tasks, simulating diverse household environments with dynamic object locations. Key tasks evaluated include Predict Object Location, Predict Relative Location Likelihood, and Find Object, each designed to assess the model's ability to adapt and predict in evolving environments.
Figure 3: The household object placement probability priors highlighting room-object-furniture relationships with varying likelihoods.
Results
The NEP, particularly the HEAT variant, significantly outperforms baseline models, including Random, Prior-based, and Bayesian methods. The introduction of SGM facilitates enhanced adaptability and learning over time, demonstrating superior prediction accuracy and reduced decision latency. The model's ability to leverage semantic and temporal features results in improved performance across all tasks.
Figure 4: The average accuracy and variance for the Predict Object Location task, demonstrating the NEP's learning efficiency over time.
Conclusion
This study successfully addresses a complex AI task by formulating an effective problem representation combining SGM and NEP models. The proposed solution shows potential for real-world applications in AI-driven object search by delivering superior adaptability and predictive mechanics. Future work could explore integration with reinforcement learning frameworks and application in more complex, realistic environments, enhancing the versatility and robustness of AI agents in dynamic settings.