Bayesian Nonparametrics for Offline Skill Discovery

Published 9 Feb 2022 in cs.LG, cs.AI, and stat.ML | (2202.04675v3)

Abstract: Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments. Our code is available at https://github.com/layer6ai-labs/BNPO .

Abstract PDF Upgrade to Chat

Authors (5)

Citations (7)

View on Semantic Scholar

Summary

The paper proposes a Bayesian nonparametric framework that eliminates the need to predefine the number of skills via a dynamic Dirichlet Process approach.
It employs variational inference with a Gumbel-Softmax reparameterization to optimize option policies and termination functions in hierarchical RL.
Empirical results in proof-of-concept and Atari environments demonstrate the model's high adaptability and superior performance against fixed-skill baselines.

Bayesian Nonparametrics for Offline Skill Discovery

Introduction

The paper "Bayesian Nonparametrics for Offline Skill Discovery" addresses the challenge of skill discovery in the context of offline reinforcement learning (RL). In hierarchical RL frameworks, skills or low-level policies provide temporal abstraction, enhancing the learning process and facilitating complex behavior. While existing methods require prior specification of the number of skills ( $K$ ), this paper proposes a Bayesian nonparametric approach that obviates the necessity to predetermine $K$ and dynamically adapts the skill set as learning progresses. This approach leverages variational inference and continuous relaxations, implementing a Dirichlet Process Mixture Model (DPMM) inspired structure. Empirical results demonstrate its superiority over state-of-the-art skill discovery algorithms in various environments.

Methodology

Model Setup

Reinforcement learning skills are represented within an options framework, comprising initiation sets, policies, and termination functions. The high-level policy controls when options should be initiated and terminated. In this work, a shared high-level policy across expert-generated trajectories is assumed, which does not depend on the specific state. This simplification allows for a Bayesian treatment of $\eta$ (the policy over options), modeled with a nonparametric infinite dimensional vector using a stick-breaking process.

Variational Inference

The paper employs variational inference to handle the intractability of direct posterior computation. The ELBO (Evidence Lower Bound) is maximized, facilitating the differentiation required to optimize the model parameters $\theta$ (option policies and termination functions) and $\phi$ (variational parameters for posterior approximation). The approximate posterior assumes a structure, respecting the conditional independences of a graphical model and uses the Gumbel-Softmax distribution to handle categorical latent variables within a reparameterizable framework.

Nonparametric Extension

The nonparametric aspect is achieved by employing a $\operatorname{GEM}(\alpha)$ prior over options $\eta$ . This facilitates working with an infinite number of potential skills, incrementing $K$ as needed based on utilization across trajectories, following a heuristic method based on usage statistics. This process dynamically increases the model's capacity to adapt to complex environments without explicit specification of $K$ , improving adaptability and performance across diverse RL tasks.

Results

Proof-of-Concept Environment

In a controlled environment designed to verify capability, the model successfully infers a sufficient number of options to capture the expert behavior, scaling well with increasing vocabulary size of agent inputs. Without manual tuning, the method consistently recovered the correct number of options to fulfill the task requirements, emphasizing its potential for cases where the number of requisite skills is not known a priori.

Figure 1: Results for our proof-of-concept environment. The model successfully recovers and utilizes a sufficient number of options aligning with theoretical expectations.

Atari Environments

Extensive evaluation in several Atari games demonstrated that the proposed model outperformed existing baselines such as DDO and CompILE. By using nonparametric skill discovery, the model achieved comparable or superior performance without pre-specifying the number of skills, achieving high adaptability across varying gameplay dynamics.

Figure 2: Results for Atari environments highlighting the competitiveness of the nonparametric model against fixed $K$ baselines.

Conclusion

The paper presents an effective strategy for skill discovery employing a Bayesian nonparametric framework, which leads to a flexible and adaptive learning system that can dynamically adjust to complex environments. By eliminating the requirement to preset the number of skills, the method reduces the burden of hyperparameter tuning and adapts naturally to varying tasks. This work opens avenues for extending nonparametric techniques in hierarchical reinforcement learning settings, promoting the development of more generalized and autonomous systems.

Figure 3: The progression of mean rewards illustrating learning efficiency with and without entropy regularizers, showcasing the robustness and adaptability of the model.

Markdown Report Issue