Optimism in Reinforcement Learning with Generalized Linear Function Approximation

Published 9 Dec 2019 in stat.ML and cs.LG | (1912.04136v1)

Abstract: We design a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation. We analyze the algorithm under a new expressivity assumption that we call "optimistic closure," which is strictly weaker than assumptions from prior analyses for the linear setting. With optimistic closure, we prove that our algorithm enjoys a regret bound of $\tilde{O}(\sqrt{d³ T})$ where $d$ is the dimensionality of the state-action features and $T$ is the number of episodes. This is the first statistically and computationally efficient algorithm for reinforcement learning with generalized linear functions.

Abstract PDF Upgrade to Chat

Citations (132)

View on Semantic Scholar

Summary

The paper presents a novel RL algorithm using optimistic closure to guarantee a regret bound of O(√(d³T)) in high-dimensional episodic environments.
It extends Q-learning through generalized linear models to efficiently manage exploration in complex, large state spaces.
The work establishes both theoretical and practical foundations for integrating GLMs into RL, highlighting significant advances in sample efficiency.

Optimism in Reinforcement Learning with Generalized Linear Function Approximation

This paper introduces a novel reinforcement learning (RL) algorithm designed to operate efficiently under the framework of generalized linear models (GLMs) for function approximation. It primarily focuses on episodic reinforcement learning problems involving infinite or extensively large state spaces, a core challenge in contemporary deep RL applications that requires strategic exploration and robust sample efficiency.

Theoretical Framework and Assumptions

The paper pioneers the use of a new expressivity assumption termed "optimistic closure," which is a strict relaxation compared to previous conditions for linear function setups. This assumption enables the algorithm to guarantee a regret bound of $O(\sqrt{d^3 T})$ , where $d$ denotes the dimensionality of state-action features and $T$ represents the number of episodes. Notably, this yields the first theoretically verified and computationally efficient RL algorithm compatible with generalized linear functions.

Optimistic closure posits a closure property on the Bellman update operator $\Tcal_h$, surpassing the linear dynamics assumptions prevalent in prior RL research, such as those in linear MDP models. Conventional assumptions often require the deterministic linear properties of environmental dynamics, which limits application flexibility given complex, real-world environment interactions.

Algorithm Overview

The RL strategy described, LSVI-UCB, extends a variation of Q-learning by approximating the optimal Q-function using a generalized linear model. The algorithm is remarkable for its simplicity and computational feasibility, majorly performing episodic iterations where it selects actions, gathers trajectories, and updates models through dynamic programming leveraging optimistic backups. The algorithm ensures optimism principle adherence by always overestimating the optimal Q-values, crucial for consistent sample efficiency in the presence of uncertainty.

Result Implications

The results signify substantial progress, particularly in environments where linear assumptions prove impractical or overly restrictive. The implications span both theoretical validation and practical applications, highlighting the capacity to utilize generalized linear models, such as logistic regressions, in sample-efficient RL algorithms without relying on specific structural properties of the environment dynamics.

Future Directions

Anticipated advancements include expanding beyond GLMs to broader function classes while still ensuring regret minimization and sample efficiency. While optimistic closure offers a useful generalization, exploring weaker assumptions that enhance algorithmic applicability across diverse settings remains an open research quest. Additionally, translating these advancements to more multidimensional, dynamic environments, where traditional RL models face scalability issues, can catalyze further innovation.

In summary, this paper represents a cogent step towards integrating advanced function approximation methods into RL frameworks, setting the stage for enhanced learning efficiency within complex, boundless state spaces.

Markdown Report Issue