Learning to Generalize from Sparse and Underspecified Rewards

Published 19 Feb 2019 in cs.LG, cs.AI, cs.CL, and stat.ML | (1902.07198v4)

Abstract: We consider the problem of learning from sparse and underspecified rewards, where an agent receives a complex input, such as a natural language instruction, and needs to generate a complex response, such as an action sequence, while only receiving binary success-failure feedback. Such success-failure rewards are often underspecified: they do not distinguish between purposeful and accidental success. Generalization from underspecified rewards hinges on discounting spurious trajectories that attain accidental success, while learning from sparse feedback requires effective exploration. We address exploration by using a mode covering direction of KL divergence to collect a diverse set of successful trajectories, followed by a mode seeking KL divergence to train a robust policy. We propose Meta Reward Learning (MeRL) to construct an auxiliary reward function that provides more refined feedback for learning. The parameters of the auxiliary reward function are optimized with respect to the validation performance of a trained policy. The MeRL approach outperforms our alternative reward learning technique based on Bayesian Optimization, and achieves the state-of-the-art on weakly-supervised semantic parsing. It improves previous work by 1.2% and 2.4% on WikiTableQuestions and WikiSQL datasets respectively.

Abstract PDF Upgrade to Chat

Citations (91)

View on Semantic Scholar

Summary

Learning to Generalize from Sparse and Underspecified Rewards

The paper "Learning to Generalize from Sparse and Underspecified Rewards" addresses the critical challenge of training agents in reinforcement learning (RL) environments where feedback is limited to binary success-failure signals, usually sparse and underspecified. The research investigates how agents can be effectively trained and generalized to new contexts, focusing on scenarios such as semantic parsing and instruction following.

The approach introduced in this paper pivots on the exploration and exploitation dichotomy in RL, particularly under conditions where traditional reward signals are insufficient for learning robust policies. To tackle sparse rewards, the authors advocate for a two-phase KL divergence strategy: first employing a mode covering divergence to explore diverse trajectories, followed by a mode seeking divergence for refining policy learning. This strategy helps accumulate diverse successful trajectories that aid in robust policy training.

A significant contribution of this work is the proposal of Meta Reward Learning (MeRL). This novel method captures auxiliary rewards—a supplementary reward structure optimized for better validation performance. Unlike traditional reward learning which might rely on demonstrations or Bayesian Optimization, MeRL leverages meta-learning to refine the search space for rewards automatically, thus enhancing generalization in unseen contexts.

The paper presents experimental evidence on tasks such as weakly-supervised semantic parsing benchmarks, including WikiTableQuestions and WikiSQL datasets. Here, the MeRL approach notably surpasses prior methods, achieving improvements of 1.2% and 2.4% on these datasets respectively. Additionally, successful demonstration in maze-based instruction following validates the robustness of the auxiliary reward structure in guiding learning.

This study's implications are manifold: by dealing with underspecified reward functions, it underscores solutions for AI safety concerns like reward hacking, and provides insights into policy training that marries exploration with refined reward synthesis. Potential future work includes more sophisticated reward models, possibly integrating neural networks, and investigating non-terminal rewards to provide richer feedback during learning.

In essence, the research advances the frontier in RL by addressing fundamental limitations in reward specification, paving the path for more adaptable and intelligent systems capable of reasoning under uncertainty and limited supervision.