Policy Eluder Dimension in RL

Updated 4 January 2026

Policy Eluder Dimension is a combinatorial complexity measure that quantifies the intrinsic difficulty of identifying the optimal policy in RL by counting the maximum number of distinct uncertainty queries.
It extends the classical eluder dimension to policy spaces, enabling learning algorithms to achieve sample complexity and regret bounds that depend on the structure of the policy class rather than the state or action space sizes.
This concept underpins policy elimination and optimistic algorithms, providing concrete guarantees that guide practical implementations in both model-free and model-based reinforcement learning.

The policy eluder dimension is a combinatorial complexity measure quantifying the intrinsic difficulty of learning an optimal policy within a given policy space in reinforcement learning (RL). It generalizes the classical eluder dimension of function classes to the setting of sequential decision-making, providing sample-complexity guarantees for learning algorithms whose performance scales with the policy space's structure rather than the size of the state or action spaces. The policy eluder dimension has become a central tool for characterizing the minimal structural assumptions necessary for sample-efficient RL and underpins tight instance-dependent bounds in both model-free and model-based regimes.

1. Definition and Formalization

Let $S$ denote the state space, $A$ the action space, $H$ the episode horizon, and $\Theta$ a parametrized family of deterministic policies, $\pi_\theta: S\times\{1,\dots,H\} \to A$ . The policy eluder dimension, $\dim_E(\Theta)$ , is defined via a notion of combinatorial distinguishability. A tuple $x=(s,a_1,a_2,h)$ , with $a_1\neq a_2$ , is independent of a set $\mathcal Z$ of such tuples if there exist $\theta_1,\theta_2\in\Theta$ that agree on all tuples in $\mathcal Z$ but differ on $x$ . The policy eluder dimension is the length of the longest sequence $x_1,\ldots,x_K$ such that each $x_i$ is independent of its predecessors. This construction counts the maximum number of "uncertainty" queries required to uniquely identify the optimal policy in $\Theta$ , reflecting how many distinct queries an algorithm must make before all policies in the class become fully specified with respect to the task (Mou et al., 2020).

The definition is not restricted to tabular policies: it extends to any statistical model of policies, including function approximators such as neural networks, provided the notion of distinguishability is defined for the policy class.

2. Comparison with Classical Eluder Dimension

The classical eluder dimension, introduced by Russo and Van Roy (2013), is defined for a (possibly real-valued) function class $\mathcal F$ . A point $x$ is independent of a sequence $x_1,\ldots,x_{i-1}$ if there exist $f_1, f_2\in\mathcal F$ that agree on the previous points but differ at $x$ . The eluder dimension is the maximal length of such an independent sequence.

The policy eluder dimension retains this spirit but adapts it to policy classes, formalizing independence in terms of distinguishability at pairs of actions for each state and time. Unlike the classical eluder dimension, which relates to function evaluation uncertainty at individual points, the policy eluder dimension operates over policy decision boundaries, focusing on action selection rather than function values. As a result, it is closely aligned with the specific demands of RL, where the agent must resolve action preferences at encountered states and time steps by discriminatively learning over the policy class (Mou et al., 2020).

3. Role in Complexity Bounds and Algorithms

The policy eluder dimension serves as the central combinatorial quantity controlling the sample complexity of policy identification. In finite-horizon settings, a policy-elimination algorithm that explores according to unresolved tuples and eliminates inconsistent policies achieves sample complexity scaling as $O(Hd(\Delta^{-2}\log(Hd/\delta)+\epsilon^{-1}))$ , where $d = \dim_E(\Theta)$ , $\Delta$ is the minimal $Q$ -gap between optimal and suboptimal actions, and $\epsilon$ is the target approximation error (Mou et al., 2020).

For deterministic systems without a simulator, the policy eluder dimension bounds the regret: after $O(Hd)$ episodes, all policy uncertainties are eliminated and optimal behavior is achieved, with a regret bound $R_T \leq 2R(H+1)d + 3RH$ for reward bound $R$ and time horizon $T$ .

In more general settings, the eluder dimension of the induced $Q$ -function class, $\mathcal F_\Pi$ associated with the policy class $\Pi$ , serves an analogous role. In both model-free and model-based RL, regret and adaptivity can be controlled as: $\mathrm{Regret}(T) = \widetilde O\left(\frac{H^5\, \dim_E(\Pi, 1/T)^2}{\Delta_{\min}}\right), \quad N_{\text{switch}} = \widetilde O\left(H\, \dim_E(\Pi, 1/T)\right)$ where $N_{\text{switch}}$ is the number of policy switches and $\Delta_{\min}$ is the minimal value gap (Velegkas et al., 2022).

4. Algorithmic Techniques Leveraging Policy Eluder Dimension

The principal algorithmic approach exploiting the policy eluder dimension is policy elimination. Such algorithms maintain an active set of policies (or an equivalent oracle) and iteratively query at tuples maximizing policy disagreement, using sampled transitions and rollouts to distinguish between candidates. Each episode either reduces the unresolved uncertainty (via elimination) or executes the current best estimate, ensuring that the total number of exploratory episodes is tightly controlled by the combinatorics of the policy class (Mou et al., 2020).

Optimistic approaches, such as GOLF and OLIVE, utilize the eluder dimension of $Q$ -value classes in their confidence set and bonus designs, with sample complexity and regret bounds depending only on the policy or value-function eluder dimension, even for general function approximation (Jin et al., 2021, Velegkas et al., 2022).

5. Worked Examples and Scaling Laws

Typical values of the policy eluder dimension are as follows:

Policy/Class Type	Policy Eluder Dimension	Scaling Interpretation
Tabular policies ( $\|S\|,\|A\|,H$ )	$O(SA^2 H)$	Grows quadratically with $A$
Finite policy class $\|\Theta\|=N$	$\le N-1$	At most $N-1$ for $N$ policies
Linear threshold policies (worst-case)	Infinite	Can be adversarially large
Linear threshold on random features	$O(\frac{1}{\epsilon}\log(N/\delta))$	Controlled by $\epsilon$ -packing
GF(2) linear functions	$D$	Equal to dimension $D$

This tabulation demonstrates that policy eluder dimension collapses standard sample complexity scaling for tabular and finite policy classes but can take markedly higher values for complex policy models unless additional constraints are imposed (Mou et al., 2020).

6. Interpretations, Limitations, and Relationships

The policy eluder dimension measures the maximal number of "hard queries" an agent must make before all uncertainty in the policy space is resolved. It quantifies the intrinsic complexity of policy learning for a given function or parameter class, generalizing classical measures such as VC or Littlestone dimension to the RL setting.

It strictly refines these earlier complexity measures in the sequential, action-selection context, capturing the unique demands of RL. Notably, in adversarially chosen feature spaces or for infinitely rich policy classes, the policy eluder dimension may be infinite, indicating that efficient learning is impossible without further structural assumptions (Mou et al., 2020).

7. Practical Computation and Implementation Guidelines

Policy-elimination algorithms based on the policy eluder dimension can be implemented either directly (in small finite classes) or via oracles maintaining the set of unresolved uncertainties as forbidden pairs or combinatorial constraints. In tabular settings, this reduces to set-difference computation; for linear or algebraic policy classes, it reduces to linear program or system solving.

In infinite policy classes, discretization via $\epsilon$ -nets or other covering arguments is standard, allowing the reduction of effective policy eluder dimension as a function of sample accuracy. The practical viability hinges on efficiently representing and updating the active policy set and managing queries to resolve uncertainties in a computationally tractable manner (Mou et al., 2020).

Markdown Report Issue Upgrade to Chat

References (3)

On the Sample Complexity of Reinforcement Learning with Policy Space Generalization (2020)

The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches (2022)

Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Policy Eluder Dimension.

Policy Eluder Dimension in RL

1. Definition and Formalization

2. Comparison with Classical Eluder Dimension

3. Role in Complexity Bounds and Algorithms

4. Algorithmic Techniques Leveraging Policy Eluder Dimension

5. Worked Examples and Scaling Laws

6. Interpretations, Limitations, and Relationships

7. Practical Computation and Implementation Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Policy Eluder Dimension in RL

1. Definition and Formalization

2. Comparison with Classical Eluder Dimension

3. Role in Complexity Bounds and Algorithms

4. Algorithmic Techniques Leveraging Policy Eluder Dimension

5. Worked Examples and Scaling Laws

6. Interpretations, Limitations, and Relationships

7. Practical Computation and Implementation Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research