Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fitted Q-Evaluation (FQE) Overview

Updated 1 January 2026
  • Fitted Q-Evaluation (FQE) is an offline policy evaluation method that estimates target policy performance using regression on logged off-policy data.
  • It iteratively minimizes the mean-squared projected Bellman error, ensuring robust policy estimates even under significant distribution shifts.
  • FQE offers theoretical guarantees through finite-sample error bounds and demonstrates efficient performance in high-dimensional, continuous-action domains.

Fitted Q-Evaluation (FQE) is an offline policy evaluation algorithm that estimates the expected cumulative reward of a target policy π using a fixed dataset of transitions generated by a possibly different (behavior) policy in Markov Decision Processes (MDPs). FQE iteratively fits the Q-function of the target policy by minimizing the mean-squared projected Bellman error under the empirical distribution of the logged data, using regression or function approximation, which may be linear models or neural networks, often under the assumption of Bellman-completeness. Its primary theoretical and practical significance lies in enabling sample-efficient, distribution-mismatch-tolerant off-policy evaluation without reliance on importance weighting, thus avoiding high variance in high-dimensional or continuous-action domains (Ji et al., 2022, Zhang et al., 2022).

1. Algorithmic Structure and Core Principles

FQE proceeds by backward recursion on the time steps (or iteratively in the discounted infinite-horizon case). At each iteration h (for finite-horizon H), given estimated Q-functions for the next time-step, FQE fits the current Q-function Q^h\widehat Q_h by minimizing the squared Bellman error

Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^2

with Q^H+10\widehat Q_{H+1} \equiv 0 and function class V\mathcal{V} (linear, kernel, or neural). For discounted infinite-horizon MDPs, the update is

Qk+1=argminQQ^1Ni=1N(Q(si,ai)[ri+γEaπ(si)Qk(si,a)])2Q_{k+1} = \arg\min_{Q\in Q̂} \frac{1}{N}\sum_{i=1}^N \left( Q(s_i, a_i) - \left[ r_i + \gamma \mathbb{E}_{a'\sim\pi(\cdot|s'_i)} Q_k(s_i', a') \right] \right)^2

The final policy-value estimate is

v^π=Esξ,aπ1(s)[Q^1(s,a)]\widehat v^\pi = \mathbb{E}_{s\sim\xi, a\sim\pi_1(\cdot|s)}[\widehat Q_1(s, a)]

FQE's distinguishing feature is that it sidesteps explicit importance sampling in favor of regression, making it robust to sampling distributions with distribution shift between behavior and target policies, as long as certain completeness/smoothness conditions are met (Ji et al., 2022).

2. Theoretical Guarantees and Error Bounds

Finite-sample and asymptotic analysis show that FQE achieves minimax-optimal or near-optimal error rates under a variety of modeling assumptions:

Finite-Horizon Nonparametric Regime

For data generated under an unknown behavior policy, with data (per-step) sample size KK and horizon HH, and under the assumption that the state-action space X\mathcal{X} lives on a compact dd-dimensional Riemannian manifold, the policy-value estimation error satisfies (Ji et al., 2022):

Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^20

where:

  • Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^21: smoothness of the Bellman operator,
  • Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^22: intrinsic dimension (of the manifold),
  • Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^23: distribution shift factor, defined via function-class-restricted Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^24 divergence,

Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^25

When Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^26, this yields

Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^27

Parametric and Ratio-Realizability Regimes

Given function completeness and realizability, the parametric model achieves Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^28 error scaling; with the additional assumption that marginal density ratios Q^h=argminfVk=1K[f(sh,k,ah,k)rh,kEaπh(sh,k)[Q^h+1(sh,k,a)]]2\widehat Q_h = \arg\min_{f\in \mathcal{V}} \sum_{k=1}^K \left[ f(s_{h,k}, a_{h,k}) - r_{h,k} - \mathbb{E}_{a\sim\pi_h(\cdot | s'_{h,k})}[\widehat Q_{h+1}(s'_{h,k}, a)] \right]^29 are realizable in the model class, the rate improves to Q^H+10\widehat Q_{H+1} \equiv 00, matching the sharpest known tabular bounds (Wang et al., 2024).

Z-Estimation and Statistical Efficiency

For general differentiable function approximators, the FQE estimator is asymptotically normal:

Q^H+10\widehat Q_{H+1} \equiv 01

where Q^H+10\widehat Q_{H+1} \equiv 02 is governed by the local curvature of the function class, Bellman-residual noise, and a function-class Q^H+10\widehat Q_{H+1} \equiv 03 divergence that measures interplay between off-policy distribution shift and function approximation (Zhang et al., 2022).

FQE attains the Cramér–Rao lower bound for asymptotic variance, i.e., no other unbiased estimator can achieve better economy under these regimes (Zhang et al., 2022, Hao et al., 2021).

3. Distribution Shift and Restricted Q^H+10\widehat Q_{H+1} \equiv 04 Divergence

The primary challenge in FQE is controlling the statistical error incurred from estimating the target policy under data collected from a potentially very different policy. Instead of direct density ratios, FQE's sample complexity and error bounds depend on a function-class-restricted Pearson Q^H+10\widehat Q_{H+1} \equiv 05 divergence defined as

Q^H+10\widehat Q_{H+1} \equiv 06

In practice, for smooth function classes or sufficiently regular neural networks (e.g., CNNs as in (Ji et al., 2022)), the restricted divergence remains controlled even when the density ratio Q^H+10\widehat Q_{H+1} \equiv 07 is unbounded, due to limited function class expressivity with respect to the 'difference' between sampling and target distributions. This is critical for high-dimensional applications where naive density ratio estimation is intractable.

4. Practical Implementation and Variants

FQE is agnostic to the choice of function class Q^H+10\widehat Q_{H+1} \equiv 08; it is commonly instantiated with deep neural networks (convolutional or feed-forward), linear regression, or trees, using standard supervised regression infrastructure. For practical scaling:

  • Network design: In visually complex domains, CNNs can exploit low intrinsic manifold dimension for sample-efficient learning, with architecture hyperparameters tuned to balance approximation and estimation error (Ji et al., 2022).
  • Training protocol: Mean-squared regression is the standard learning objective, usually with Adam or SGD, and number of regression steps set for convergence monitoring (Kondrup et al., 2022).
  • Regularization: No additional importance weighting or correction is typically used; the method is purely regression-based.
  • Hyperparameter selection: Recent work proposes distribution-mismatch aware criteria for choosing function class, regularization strength, and iteration count, with explicit control on suboptimality (Miyaguchi, 2022).
  • Representation learning: OPE-specific encoders based on behavioral similarity metrics can significantly reduce divergence and boost data efficiency by clustering state-actions with similar Q-value contributions (Pavse et al., 2023).

A high-level pseudocode outline:

V\mathcal{V}0

5. Distributional and Robust Extensions

Distributional FQE

FQE has been extended to the full return distribution estimation (“fitted distributional evaluation”, FDE), where, instead of minimizing Bellman-squared error on scalars, one minimizes statistical divergences (e.g., Cramér, MMD, KL) between return distributions under empirical Bellman backups and model predictions (Hong et al., 24 Jun 2025). Population-level contraction and error bounds, analogous to scalar FQE, are established under suitable divergences.

Robustness to Confounding

Robust FQE methods have been proposed for settings with sequentially exogenous unobserved confounders, combining robust Bellman operators (for rectangular ambiguity sets) and orthogonalized estimation to ensure consistency under policy evaluation with unmeasured confounding. Algorithmic implementations use two-stage regression—quantile estimation for adverse transitions and bias-corrected Bellman targets—to maintain Q^H+10\widehat Q_{H+1} \equiv 09 rates (Bruns-Smith et al., 2023).

Elimination of Bellman Completeness

A major limitation of classical FQE is the need for Bellman-completeness, i.e., closure of the function class under the Bellman operator. Recent advances propose stationary weighting (SW-FQE), where each regression step is reweighted by an estimate of the stationary density ratio between the target policy and the behavior policy, thus restoring geometric contraction without any Bellman completeness or dual/primal realizability assumptions. Error bounds avoid geometric blow-up, being additive rather than multiplicative in function-approximation and statistical error (Laan et al., 29 Dec 2025).

6. Empirical Performance and Application Domains

FQE has shown strong empirical performance across high-dimensional continuous and discrete domains, including control benchmarks (e.g., CartPole, MountainCar, MuJoCo), and health-care settings with sparse or high-stakes rewards (e.g., mechanical ventilation and sepsis management in clinical datasets) (Kondrup et al., 2022, Hao et al., 2021). Empirical evidence indicates:

  • Policy-value estimates converge to true Monte Carlo returns as the data size increases.
  • Performance depends sharply on intrinsic state-action manifold dimension and smoothness, not ambient dimension (Ji et al., 2022).
  • FQE is robust to moderate-to-severe distribution shift when using smooth function classes or OPE-tailored representations (Pavse et al., 2023).
  • Bootstrapped FQE CIs provide tight, reliable coverage relative to importance-sampling or concentration-based methods, with computational acceleration feasible via subsampling (Hao et al., 2021).

An illustrative evaluation from (Kondrup et al., 2022) (ventilator management):

Policy FQE-estimated return (mean ± SE)
Physician 0.502 ± 0.007
Behavior Cloning 0.572 ± 0.002
DeepVent⁻ 0.729 ± 0.002
DeepVent 0.743 ± 0.005

7. Challenges, Hyperparameter Selection, and Open Directions

The main technical and practical bottlenecks are:

  • Sensitivity to function-class mismatch when distribution shift is severe or sample sizes are small, motivating restricted divergence-based error control (Ji et al., 2022).
  • Hyperparameter selection for capacity, regularization, and iterations can have significant impact on both estimation and generalization. Recent frameworks provide distribution-mismatch-tolerant, criterion-based selection procedures with explicit error bounds and computational tradeoffs (Miyaguchi, 2022).
  • In high-stakes or confounded domains, robust and orthogonalized variants of FQE yield valid uncertainty quantification and reduced sensitivity to unmeasured biases (Bruns-Smith et al., 2023).
  • Ongoing research addresses extending FQE/FDE to richer, structured returns, nonstationary domains, or with adaptive representation and kernel-based regression to further improve statistical and computational performance.

FQE thus provides an indispensable, theoretically principled, and empirically validated methodology for offline evaluation in reinforcement learning, with ongoing developments aimed at relaxing completeness requirements, strengthening robustness, and automating hyperparameter selection to enhance reliability and interpretability in practical deployment (Ji et al., 2022, Wang et al., 2024, Laan et al., 29 Dec 2025, Miyaguchi, 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fitted Q-Evaluation (FQE).