Policy Eluder Dimension in RL
- Policy Eluder Dimension is a combinatorial complexity measure that quantifies the intrinsic difficulty of identifying the optimal policy in RL by counting the maximum number of distinct uncertainty queries.
- It extends the classical eluder dimension to policy spaces, enabling learning algorithms to achieve sample complexity and regret bounds that depend on the structure of the policy class rather than the state or action space sizes.
- This concept underpins policy elimination and optimistic algorithms, providing concrete guarantees that guide practical implementations in both model-free and model-based reinforcement learning.
The policy eluder dimension is a combinatorial complexity measure quantifying the intrinsic difficulty of learning an optimal policy within a given policy space in reinforcement learning (RL). It generalizes the classical eluder dimension of function classes to the setting of sequential decision-making, providing sample-complexity guarantees for learning algorithms whose performance scales with the policy space's structure rather than the size of the state or action spaces. The policy eluder dimension has become a central tool for characterizing the minimal structural assumptions necessary for sample-efficient RL and underpins tight instance-dependent bounds in both model-free and model-based regimes.
1. Definition and Formalization
Let denote the state space, the action space, the episode horizon, and a parametrized family of deterministic policies, . The policy eluder dimension, , is defined via a notion of combinatorial distinguishability. A tuple , with , is independent of a set of such tuples if there exist that agree on all tuples in but differ on . The policy eluder dimension is the length of the longest sequence such that each is independent of its predecessors. This construction counts the maximum number of "uncertainty" queries required to uniquely identify the optimal policy in , reflecting how many distinct queries an algorithm must make before all policies in the class become fully specified with respect to the task (Mou et al., 2020).
The definition is not restricted to tabular policies: it extends to any statistical model of policies, including function approximators such as neural networks, provided the notion of distinguishability is defined for the policy class.
2. Comparison with Classical Eluder Dimension
The classical eluder dimension, introduced by Russo and Van Roy (2013), is defined for a (possibly real-valued) function class . A point is independent of a sequence if there exist that agree on the previous points but differ at . The eluder dimension is the maximal length of such an independent sequence.
The policy eluder dimension retains this spirit but adapts it to policy classes, formalizing independence in terms of distinguishability at pairs of actions for each state and time. Unlike the classical eluder dimension, which relates to function evaluation uncertainty at individual points, the policy eluder dimension operates over policy decision boundaries, focusing on action selection rather than function values. As a result, it is closely aligned with the specific demands of RL, where the agent must resolve action preferences at encountered states and time steps by discriminatively learning over the policy class (Mou et al., 2020).
3. Role in Complexity Bounds and Algorithms
The policy eluder dimension serves as the central combinatorial quantity controlling the sample complexity of policy identification. In finite-horizon settings, a policy-elimination algorithm that explores according to unresolved tuples and eliminates inconsistent policies achieves sample complexity scaling as , where , is the minimal -gap between optimal and suboptimal actions, and is the target approximation error (Mou et al., 2020).
For deterministic systems without a simulator, the policy eluder dimension bounds the regret: after episodes, all policy uncertainties are eliminated and optimal behavior is achieved, with a regret bound for reward bound and time horizon .
In more general settings, the eluder dimension of the induced -function class, associated with the policy class , serves an analogous role. In both model-free and model-based RL, regret and adaptivity can be controlled as: where is the number of policy switches and is the minimal value gap (Velegkas et al., 2022).
4. Algorithmic Techniques Leveraging Policy Eluder Dimension
The principal algorithmic approach exploiting the policy eluder dimension is policy elimination. Such algorithms maintain an active set of policies (or an equivalent oracle) and iteratively query at tuples maximizing policy disagreement, using sampled transitions and rollouts to distinguish between candidates. Each episode either reduces the unresolved uncertainty (via elimination) or executes the current best estimate, ensuring that the total number of exploratory episodes is tightly controlled by the combinatorics of the policy class (Mou et al., 2020).
Optimistic approaches, such as GOLF and OLIVE, utilize the eluder dimension of -value classes in their confidence set and bonus designs, with sample complexity and regret bounds depending only on the policy or value-function eluder dimension, even for general function approximation (Jin et al., 2021, Velegkas et al., 2022).
5. Worked Examples and Scaling Laws
Typical values of the policy eluder dimension are as follows:
| Policy/Class Type | Policy Eluder Dimension | Scaling Interpretation |
|---|---|---|
| Tabular policies () | Grows quadratically with | |
| Finite policy class | At most for policies | |
| Linear threshold policies (worst-case) | Infinite | Can be adversarially large |
| Linear threshold on random features | Controlled by -packing | |
| GF(2) linear functions | Equal to dimension |
This tabulation demonstrates that policy eluder dimension collapses standard sample complexity scaling for tabular and finite policy classes but can take markedly higher values for complex policy models unless additional constraints are imposed (Mou et al., 2020).
6. Interpretations, Limitations, and Relationships
The policy eluder dimension measures the maximal number of "hard queries" an agent must make before all uncertainty in the policy space is resolved. It quantifies the intrinsic complexity of policy learning for a given function or parameter class, generalizing classical measures such as VC or Littlestone dimension to the RL setting.
It strictly refines these earlier complexity measures in the sequential, action-selection context, capturing the unique demands of RL. Notably, in adversarially chosen feature spaces or for infinitely rich policy classes, the policy eluder dimension may be infinite, indicating that efficient learning is impossible without further structural assumptions (Mou et al., 2020).
7. Practical Computation and Implementation Guidelines
Policy-elimination algorithms based on the policy eluder dimension can be implemented either directly (in small finite classes) or via oracles maintaining the set of unresolved uncertainties as forbidden pairs or combinatorial constraints. In tabular settings, this reduces to set-difference computation; for linear or algebraic policy classes, it reduces to linear program or system solving.
In infinite policy classes, discretization via -nets or other covering arguments is standard, allowing the reduction of effective policy eluder dimension as a function of sample accuracy. The practical viability hinges on efficiently representing and updating the active policy set and managing queries to resolve uncertainties in a computationally tractable manner (Mou et al., 2020).