Papers
Topics
Authors
Recent
Search
2000 character limit reached

A nearly Blackwell-optimal policy gradient method

Published 28 May 2021 in cs.LG, cs.AI, cs.SY, and eess.SY | (2105.13609v3)

Abstract: For continuing environments, reinforcement learning (RL) methods commonly maximize the discounted reward criterion with discount factor close to 1 in order to approximate the average reward (the gain). However, such a criterion only considers the long-run steady-state performance, ignoring the transient behaviour in transient states. In this work, we develop a policy gradient method that optimizes the gain, then the bias (which indicates the transient performance and is important to capably select from policies with equal gain). We derive expressions that enable sampling for the gradient of the bias and its preconditioning Fisher matrix. We further devise an algorithm that solves the gain-then-bias (bi-level) optimization. Its key ingredient is an RL-specific logarithmic barrier function. Experimental results provide insights into the fundamental mechanisms of our proposal.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.