ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

Published 30 Jun 2020 in cs.CV and cs.CL | (2006.16934v3)

Abstract: We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured knowledge obtained from scene graphs to learn joint representations of vision-language. ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal tasks. Utilizing scene graphs of visual scenes, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks in the pre-training phase. Specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language. After pre-training on large scale image-text aligned datasets, we validate the effectiveness of ERNIE-ViL on 5 cross-modal downstream tasks. ERNIE-ViL achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%.

Abstract PDF Upgrade to Chat

Citations (358)

View on Semantic Scholar

Summary

The paper introduces a novel scene graph-based pre-training framework that enriches vision-language representations by capturing fine-grained semantic details.
It employs prediction tasks for objects, attributes, and relationships, achieving improvements such as a 3.7% gain on Visual Commonsense Reasoning.
The results underscore the value of integrating structured knowledge to advance future models for complex cross-modal tasks.

An Overview of ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

The paper "ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs" introduces a novel approach in the domain of vision-language pre-training. The proposed model, ERNIE-ViL, leverages structured knowledge from scene graphs to enhance joint representations for cross-modal tasks. This contrasts with existing methods that predominantly rely on sub-word masking without sufficient emphasis on detailed semantic alignments across vision and language.

Core Contributions

Incorporation of Structured Knowledge: ERNIE-ViL distinguishes itself by integrating structured knowledge derived from scene graphs, particularly focusing on objects, attributes of objects, and relationships between objects. This aids in accurately capturing the fine-grained semantic details necessary for a nuanced understanding of visual scenes.
Scene Graph Prediction Tasks: The model constructs Scene Graph Prediction tasks during pre-training, which include Object Prediction, Attribute Prediction, and Relationship Prediction tasks. These tasks compel the model to enrich the vision-language representations by learning the semantic alignments at a granular level.
Performance Across Tasks: When evaluated on five cross-modal downstream tasks, ERNIE-ViL delivers state-of-the-art results, notably achieving the top position on the Visual Commonsense Reasoning (VCR) leaderboard with a significant 3.7% improvement.

Experimental Setup and Results

ERNIE-ViL is pre-trained on large image-text datasets such as Conceptual Captions and SBU Captions, and its performance is validated across a suite of tasks including Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Region-to-Phrase Grounding (RefCOCO+), and Image-Text Retrieval. The model outperforms baseline models across all tasks, particularly excelling in tasks requiring fine-grained semantic understanding.

The effectiveness of the Scene Graph Prediction pre-training tasks is evident in tasks like RefCOCO+, which demand high precision in semantic alignment, with ERNIE-ViL showing an improvement of 2.4% on both test sets. Additionally, on VCR tasks, the model achieves a remarkable improvement in the holistic Q $\to$ AR setting by 6.60% compared to previous models.

Implications and Future Directions

The introduction of scene graph-based pre-training tasks opens new avenues for enhancing cross-modal representations. By incorporating structured knowledge, ERNIE-ViL sets a precedent for future models aiming to capture detailed semantic alignments across modalities. Furthermore, expanding the scope to include scene graphs extracted directly from images and integrating advanced graph neural network techniques could further bolster the capabilities of such models.

In conclusion, ERNIE-ViL marks a substantial step forward in vision-language pre-training by effectively utilizing scene graphs. Its impact is underscored by superior performance in standard benchmarks, highlighting the value of detailed semantic alignments facilitated by structured knowledge integration. As vision-language tasks grow increasingly complex, the methods introduced in ERNIE-ViL are likely to serve as foundational elements in the development of future sophisticated models.

Markdown Report Issue