No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling

Published 24 Apr 2018 in cs.CL, cs.AI, cs.CV, and cs.LG | (1804.09160v2)

Abstract: Though impressive results have been achieved in visual captioning, the task of generating abstract stories from photo streams is still a little-tapped problem. Different from captions, stories have more expressive language styles and contain many imaginary concepts that do not appear in the images. Thus it poses challenges to behavioral cloning algorithms. Furthermore, due to the limitations of automatic metrics on evaluating story quality, reinforcement learning methods with hand-crafted rewards also face difficulties in gaining an overall performance boost. Therefore, we propose an Adversarial REward Learning (AREL) framework to learn an implicit reward function from human demonstrations, and then optimize policy search with the learned reward function. Though automatic eval- uation indicates slight performance boost over state-of-the-art (SOTA) methods in cloning expert behaviors, human evaluation shows that our approach achieves significant improvement in generating more human-like stories than SOTA systems.

Abstract PDF Upgrade to Chat

Citations (156)

View on Semantic Scholar

Summary

The paper proposes AREL, a framework that uses adversarial reward learning from human demonstrations to overcome the shortcomings of traditional metrics in visual storytelling.
The paper's approach achieves marginal improvements on automatic metrics while significantly enhancing human evaluations of story relevance, expressiveness, and concreteness.
The paper criticizes standard metrics like BLEU and METEOR for their inability to capture nuanced storytelling and advocates for human-like reward functions in narrative generation.

An Exposition on Adversarial Reward Learning for Visual Storytelling

The paper "No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling" presents an advanced framework for enhancing the quality of generated visual stories, leveraging the concept of Adversarial REward Learning (AREL). This methodology addresses the inherent limitations of current metrics to capture the nuanced storytelling capabilities that surpass mere image captioning, posing comprehensively structured narrative challenges.

The task of visual storytelling requires generating abstract narratives that not only pertain to specific visual cues from photo streams but also incorporate imaginative concepts, emotions, and subjective expressions. Traditional approaches, largely inspired by visual captioning, fall short due to their inherent bias towards simple, descriptive language. Reinforcement learning (RL) strategies, relying on hand-crafted reward functions based on metrics like BLEU or METEOR, encounter significant challenges in this domain due to their limited capability to model complex semantic and expressive qualities of human-like stories.

The AREL framework proposed in the paper ingeniously addresses this challenge by developing a reward function through human demonstrations that can closely emulate human judgments, rather than relying on conventional string-matching metrics. The framework integrates concepts from inverse reinforcement learning (IRL) and adversarial methods to learn this implicit reward function, which is then employed to refine the storytelling policy.

Key contributions of the AREL approach are delineated as follows:

The framework successfully employs adversarial techniques to derive a robust reward model that learns from human-annotated examples, thereby overcoming the weaknesses of automatic metrics in capturing expressive narratives.
Empirical evaluations indicate that while AREL offers marginal improvements over state-of-the-art methods on traditional automatic metrics, it significantly enhances human judgment of story quality. Specifically, comprehensive human evaluations show that AREL-generated stories are more relevant, expressive, and concrete.
The paper highlights the critical insight that prevalent automatic metrics, such as BLEU and METEOR, lack the necessary sensitivity to appraise the complex semantic layers found in storytelling. This paper provides a thoughtful critique of using these metrics for training RL policies in sequential generation tasks, emphasizing a disconnect with human evaluators.
By leveraging a Boltzmann distribution to relate reward learning with distribution approximation, the AREL framework enables plausible human-like story generation, fostering machines' storytelling capabilities to allow for more nuanced narratives.

The paper importantly elucidates the disconnect between automatic evaluation metrics and human perceptions of literary quality in the field of storytelling, a notable contribution in advancing the theoretical understanding of evaluating AI-generated narrative content. It offers potential directions for future research in AI to explore methods that align more closely with humanistic interpretations, possibly through enriched data collection methods and diversified narrative datasets. Consequently, this research signifies a pivotal step towards rendering machine-generated storytelling more akin to human creativity and cultural narration, with broader implications for the deployment of AI in creative spheres and interactive systems.