Maximum Likelihood Reinforcement Learning

Updated 4 February 2026

MaxRL is a reinforcement learning framework that directly maximizes the likelihood of desired outcomes using sample-based or model-based maximum likelihood estimation.
It employs statistical techniques such as logistic likelihood modeling and reward/value-biased MLE to improve exploration and policy robustness.
Empirical results show that MaxRL methods can yield significant gains, with up to 20× efficiency improvements in correct solution rates and lower regret.

Maximum Likelihood Reinforcement Learning (MaxRL) is a statistical and computational framework for reinforcement learning (RL) that aims to directly maximize the likelihood of observed or desired outcomes, typically through sample-based or model-based maximum likelihood estimation. In contrast to traditional RL objectives that maximize expected cumulative reward, MaxRL leverages probabilistic residual modeling, maximum likelihood criteria, and principled optimism to enhance learning efficiency, robustness, and adaptivity across diverse settings, including discrete, continuous-time, and high-dimensional environments.

1. Conceptual Foundations and Formal Objective

MaxRL formalizes the learning task as maximizing the likelihood of successful or desired trajectories under a parameterized model. For "correctness-based" RL, this corresponds to maximizing the marginal probability that a policy-controlled rollout yields a correct solution. Concretely, let $m_\theta(z\mid x)$ denote the model policy over latent rollouts $z$ given input $x$ , and let $f(z)$ map the rollout to an output compared to target $y^*(x)$ . The pass rate is

$p_\theta(x)\;=\;\Pr_{z\sim m_\theta(\cdot\mid x)}[f(z)=y^*(x)]\;.$

The maximum-likelihood objective is

$L(\theta) = \mathbb{E}_{x\sim \rho}\left[\log p_\theta(x)\right]\;,$

where $\rho$ is the task distribution. This contrasts with standard policy-gradient RL, which maximizes the expected pass rate, $\mathbb{E}_{x\sim \rho}[p_\theta(x)]$ . MaxRL re-weights gradient updates by $1/p_\theta(x)$ , emphasizing rare or hard-to-reach successes (Tajwar et al., 2 Feb 2026).

MaxRL principles are also instantiated in model-based RL, where unknown Markov Decision Process (MDP) dynamics are estimated by maximizing the likelihood of observed transitions, often with additional exploration or reward biases to resolve identifiability and exploration-exploitation trade-offs (Mete et al., 2020, Hung et al., 2023).

2. Methodological Variants and Statistical Modeling

MaxRL admits multiple variants depending on the statistical model choice and problem structure. A central methodological dimension is the explicit modeling of residuals or system errors with appropriate probability distributions, followed by likelihood maximization:

Bellman Error Distribution Modeling: In Q-learning, the temporal-difference (TD) or Bellman error

$\epsilon_\theta(s,a) = r(s,a) + \gamma\max_{a'} Q_\theta(s', a') - Q_\theta(s,a)$

can be empirically observed to follow a specific distribution. For instance, "Logistic Likelihood Q-Learning" (LLQL) models $\epsilon_\theta(s,a)$ with a Logistic law, replacing the typical MSE loss with the negative log-likelihood under the Logistic pdf:

$\ell_{\mathrm{LLoss}}(\delta) = |\delta|/s + 2 \log\left(1 + e^{-|\delta|/s}\right)$

where $s$ is a scale parameter (Lv et al., 2023). This improves robustness over traditional Gaussian-based MSE, particularly in the presence of heavy-tailed errors.

Reward- or Value-Biased Maximum Likelihood: In unknown MDPs, standard maximum likelihood estimation of transition parameters tends to stall at suboptimal policies due to closed-loop identifiability. MaxRL variants such as Reward-Biased MLE (RBMLE) and Value-Biased MLE (VBMLE) add an explicit optimism bias. For instance, RBMLE augments the log-likelihood by a reward-bias term:

$\theta_t \in \arg\max_{\theta} \left\{ L_t(\theta) + \beta_t J(\theta)\right\}$

where $J(\theta)$ is the optimal average reward under parameter $\theta$ (Mete et al., 2020). VBMLE adapts this for linearly parameterized transition models, leveraging

$\theta_t = \arg\max_\theta\left\{ \ell_{t-1}(\theta) + \alpha(t) V^*(s_t; \theta) \right\}$

where $V^*(s_t; \theta)$ is the optimal value function given $\theta$ (Hung et al., 2023).

Continuous-Time MaxRL: In continuous-time RL (CTRL), MaxRL estimates the conditional marginal transition densities of stochastic differential equations using MLE, often via surrogate objectives such as score matching:

$(f,g) = \arg\max_{(f,g)\in\mathcal{F}\times\mathcal{G}} \sum_{i,k} \log p_{f,g}(x_i(t_{i}^{k+1}) | x_i(t_i^k), u_i, \Delta_{i,k})$

Actions are then chosen by optimism over the confidence set of plausible models (Zhao et al., 4 Aug 2025).

3. Algorithmic Implementations

Algorithmic realization of MaxRL is tightly coupled to the statistical modeling and computational properties of the domain:

Truncated Compute-Indexed Objectives: In correctness-based RL, MaxRL deploys a compute-indexed objective:

$J^{(T)}_{\mathrm{MaxRL}}(x) = -\sum_{k=1}^T \frac{(1 - p_\theta(x))^k}{k} = \sum_{k=1}^T \frac{1}{k}\,\mathrm{pass@}k(x)$

converging to $\log p_\theta(x)$ as $T\to\infty$ (Tajwar et al., 2 Feb 2026). Unbiased policy-gradient estimators are constructed by normalizing gradient contributions by the number of successful rollouts.

Logistic Bellman Error Regression (LLQL): For LLQL, standard Q-learning is modified by simply swapping the MSE loss with LLoss, with no change to Bellman backup or target network updates. Full pseudocode clarifies the batch, target computation, and gradient steps, and the batch size is set by a bias-variance analysis of the empirical logistic CDF, typically $N=256$ (Lv et al., 2023).
Optimism in MLE Planning: RBMLE and VBMLE select parameters and policies by maximizing a likelihood-plus-reward or likelihood-plus-value criterion, efficiently trading off between model fit and optimism. Algorithmic schedules typically involve episodic updates, adaptive bias scheduling (e.g., $\beta_t = a \log t$ ), and convex or dual-coordinate ascent solves.
Continuous-Time MaxRL with Randomized Measurement: The confidence set over transition models is periodically updated as more data is collected, and a randomized measurement grid is introduced to improve trajectory integral estimates while maintaining sample efficiency (Zhao et al., 4 Aug 2025).

4. Theoretical Guarantees and Statistical Analysis

MaxRL frameworks have established stringent regret bounds and statistical convergence results:

Regret in Model-Based MaxRL: RBMLE achieves $\mathcal{O}(\log T)$ regret for ergodic finite-state MDPs, outperforming UCRL2 and Thompson Sampling in tabular benchmarks (Mete et al., 2020). VBMLE attains $\widetilde{\mathcal{O}}(d\sqrt{T})$ regret for linear MDPs, with $d$ the feature dimension, verified by supermartingale arguments and value difference lemmas (Hung et al., 2023).
Distributional Validation in Bellman Error Modeling: LLQL demonstrates through Kolmogorov-Smirnov tests that the empirical Bellman error is much better fit by a Logistic distribution than by Gaussian or Gumbel, yielding lower SSE/RMSE and higher $R^2$ (Lv et al., 2023).
Instance-Dependent Bounds in Continuous-Time Domains: In continuous-time RL, the regret of CT-MLE scales with the sum of reward variances across executed policies and is adaptive to measurement grids, with an explicit measurement penalty that can be neutralized by matching sampling granularity to stochasticity (Zhao et al., 4 Aug 2025).

5. Empirical Performance and Practical Impact

Across a variety of environments and evaluation criteria, MaxRL and its instantiations demonstrate robust improvements:

LLQL leads to systematic gains in both online (Gym Mujoco) and offline (D4RL) RL regimes, with mean reward boosts of $+78.8\,\%$ vs SAC, $+12.6\,\%$ vs CQL, and $+8.46\,\%$ over IQL in offline tasks. LLoss consistently outperforms MSELoss over a wide range of hyperparameters and maintains stability throughout training (Lv et al., 2023).
MaxRL for correctness-based rollouts yields up to $20\times$ test-time scaling efficiency compared to GRPO, attains higher pass@ $k$ in maze navigation and mathematical problem solving, and matches or exceeds cross-entropy supervised baselines in differentiable control (Tajwar et al., 2 Feb 2026).
RBMLE and VBMLE substantially outperform UCRL2 and regression-based methods in cumulative regret and computation time for tabular and linear MDPs, scaling favorably to large state-action spaces and featuring robust parameter estimation (Mete et al., 2020, Hung et al., 2023).
Continuous-Time CT-MLE automatically adapts measurement effort to environment stochasticity. Empirical optimal grid size scales with noise variance and the episode count to reach a given suboptimality decays as $1/\sqrt{N}$ , as predicted by the theoretical analysis (Zhao et al., 4 Aug 2025).

6. Generalization, Connections, and Outlook

MaxRL unifies several lines of research in RL, statistics, and online learning:

Connection to Exponential Families and Distributional RL: MaxRL can be viewed as fitting parametric exponential-family models (e.g., Gaussian, Logistic) to TD errors or transition statistics, providing a pathway to principled generalization in heavy-tailed or non-Gaussian regimes (Lv et al., 2023).
Optimism in the Face of Uncertainty: The reward/value bias in MaxRL formalizes optimism-based exploration as an instance of maximum-likelihood preference for high-reward (or high-value) plausible models, aligning with OFU principles in finite- and linear-MDP control (Mete et al., 2020, Hung et al., 2023).
Online Learning and Portfolio Theory Links: VBMLE's MLE update in linear MDPs parallels Follow-the-Leader portfolio updates, with empirical regret controlled via supermartingale-based deviation bounds (Hung et al., 2023).
Adaptivity and Scalability: The MaxRL approach in continuous-time settings provides an adaptive allocation of sample complexity by matching observation density to observed reward variance, resulting in instance-optimal learning curves (Zhao et al., 4 Aug 2025).
Potential for Large-Scale RLHF: MaxRL's alignment of learning signals to true log success-probabilities allows for deployment in RLHF pipelines, supporting robust training even with black-box non-differentiable successes, while preventing mode collapse and improving scaling with compute (Tajwar et al., 2 Feb 2026).

Further research directions include scalable MaxRL under function approximation, adaptive bias scheduling in highly nonstationary environments, and principled combination with posterior/post-hoc heuristics in deep and offline RL settings.