Graphical Contrastive Losses for Scene Graph Parsing

Published 7 Mar 2019 in cs.CV | (1903.02728v5)

Abstract: Most scene graph parsers use a two-stage pipeline to detect visual relationships: the first stage detects entities, and the second predicts the predicate for each entity pair using a softmax distribution. We find that such pipelines, trained with only a cross entropy loss over predicate classes, suffer from two common errors. The first, Entity Instance Confusion, occurs when the model confuses multiple instances of the same type of entity (e.g. multiple cups). The second, Proximal Relationship Ambiguity, arises when multiple subject-predicate-object triplets appear in close proximity with the same predicate, and the model struggles to infer the correct subject-object pairings (e.g. mis-pairing musicians and their instruments). We propose a set of contrastive loss formulations that specifically target these types of errors within the scene graph parsing problem, collectively termed the Graphical Contrastive Losses. These losses explicitly force the model to disambiguate related and unrelated instances through margin constraints specific to each type of confusion. We further construct a relationship detector, called RelDN, using the aforementioned pipeline to demonstrate the efficacy of our proposed losses. Our model outperforms the winning method of the OpenImages Relationship Detection Challenge by 4.7\% (16.5\% relative) on the test set. We also show improved results over the best previous methods on the Visual Genome and Visual Relationship Detection datasets.

Abstract PDF Upgrade to Chat

Citations (220)

View on Semantic Scholar

Summary

The paper introduces Graphical Contrastive Losses that specifically target errors like entity instance confusion and proximal relationship ambiguity.
It details three loss functions—class agnostic, entity class aware, and predicate class aware—designed to disambiguate entities and relationships.
Integration with the RelDN model yields a 4.7% absolute improvement on benchmark tests, showcasing enhanced scene graph parsing performance.

Graphical Contrastive Losses for Scene Graph Parsing

The paper "Graphical Contrastive Losses for Scene Graph Parsing" presents a new approach to improve the accuracy of scene graph parsers by addressing two prevalent types of errors: Entity Instance Confusion and Proximal Relationship Ambiguity. It introduces novel contrastive loss functions specifically designed to tackle these issues, thereby enhancing the performance of scene graph models.

Scene graph parsing aims to extract a structured representation from an image, comprising entities and their relationships expressed as $\langle subject, predicate, object \rangle$ triplets. Traditional models, typically employing a two-stage pipeline, first detect entities and then attempt to classify relationships between these entities using a softmax distribution over predicates. However, this approach can suffer from ambiguities when an image contains multiple similar entities or multiple relationships with overlapping features. The paper identifies that these ambiguities lead to two primary errors: Entity Instance Confusion, where the model fails to distinguish between similar entities, and Proximal Relationship Ambiguity, where it struggles to pair entities correctly due to their close spatial proximity.

The authors propose a set of contrastive loss functions—termed Graphical Contrastive Losses—that specifically address these errors by incorporating margin constraints into the training process. These losses are divided into three categories:

Class Agnostic Loss: Designed to generally separate related and unrelated entity pairs without considering specific classes.
Entity Class Aware Loss: Focuses on disambiguating instances of the same entity class to resolve Entity Instance Confusion.
Predicate Class Aware Loss: Targets the separation of proximate relationships involving the same predicate class to alleviate Proximal Relationship Ambiguity.

To demonstrate the effectiveness of these proposed losses, the authors develop a relationship detection network named RelDN (Relationship Detection Network). This model incorporates a separate convolutional neural network branch for predicate recognition, allowing it to focus on areas indicating interactions rather than individual entities. The integration of Graphical Contrastive Losses with this architecture shows substantial improvements in benchmark performance over existing methods.

The proposed solution outperforms previous state-of-the-art methods in various metrics. On the OpenImages Relationship Detection Challenge test set, the RelDN model surpasses the leading method by 4.7% (a 16.5% relative improvement). Similarly, it achieves enhanced results on the Visual Genome and Visual Relationship Detection datasets, thereby confirming the viability of the proposed losses in improving scene graph parsing.

The implications of this research are significant. By directly addressing the inherent limitations of existing scene graph parsers, this work opens avenues for more robust image analysis applications, potentially benefiting tasks such as visual question answering, image captioning, and autonomous navigation systems. Future developments could involve exploring additional error types and refining the loss functions to further optimize the disambiguation process in scene graph parsing.

In conclusion, the study demonstrates that by utilizing specialized graphical contrastive losses, one can effectively reduce common errors in scene graph parsing. This advancement contributes a substantial improvement to visual relationship detection, hence potentially aiding a wide range of visual understanding applications in artificial intelligence.

Markdown Report Issue