Follow-The-Regularized-Leader (FTRL)

Updated 27 January 2026

Follow-The-Regularized-Leader (FTRL) is an online convex optimization paradigm that minimizes cumulative loss through regularized empirical risk minimization.
The framework employs adaptive learning-rate strategies like Stability–Penalty Matching to balance exploration and exploitation in adversarial, stochastic, and game-theoretic settings.
FTRL’s robust design uses carefully chosen regularizers and geometry-aware updates to achieve both minimax and instance-optimal regret guarantees in diverse applications.

Follow-The-Regularized-Leader (FTRL) is a foundational paradigm in online learning and online convex optimization, providing a unifying framework for algorithms that achieve low regret in adversarial, stochastic, and hybrid environments. Its modern theoretical and algorithmic development is characterized by advanced learning-rate adaptation, competitive analysis, and applicability to a wide range of sequential decision problems including multi-armed bandits, linear and contextual bandits, and learning in games.

1. Formal Framework and Algorithmic Structure

The FTRL methodology operates over a convex decision set $\mathcal{K} \subseteq \mathbb{R}^d$ via the repeated resolution of regularized empirical risk minimization problems. At round $t$ , the algorithm selects

$x_{t+1} \in \arg\min_{x \in \mathcal{K}} \left\{ \sum_{s=1}^t \langle g_s, x \rangle + \frac{1}{\eta_t} R(x) \right\}$

where $g_s$ is a subgradient or loss vector revealed up to round $s$ , $R : \mathcal{K} \to \mathbb{R}$ is a strongly convex regularizer (e.g., negative entropy, squared Euclidean norm), and $\eta_t$ is a learning rate that may vary over time (Ito et al., 2024). This generic template admits a broad spectrum of instantiations:

Full-information: $g_s$ is the loss or gradient over the decision variable.
Bandit feedback: $g_s$ is constructed via unbiased importance-weighted estimators.
Game-theoretic settings: $g_s$ may represent payoffs in adversarial or competitive games.

The overall update encapsulates both a cumulative loss minimization (via $\sum \langle g_s, x \rangle$ ) and an exploratory/stabilizing regularization (via $R(x)/\eta_t$ ), allowing explicit control over the exploitation-exploration trade-off and adaptation to problem geometry and feedback structure (Ahn et al., 2024, Moridomi et al., 2017).

2. Regret Decomposition, Competitive Analysis, and Learning-rate Adaptation

Standard FTRL analysis yields regret bounds of the form

$\operatorname{Regret}_T \leq \sum_{t=1}^T [\eta_t z_t + (1/\eta_t - 1/\eta_{t-1}) h_t]$

where $z_t$ measures stability (sensitivity of iterates to incremental loss), while $h_t$ quantifies the change in regularization between steps. Defining $\beta_t = 1/\eta_t$ , this can be written as

$F(\beta_{1:T}; z_{1:T}, h_{1:T}) = \sum_{t=1}^T \left[ \frac{z_t}{\beta_t} + (\beta_t - \beta_{t-1}) h_t \right]$

and the optimal "offline" learning-rate schedule is obtained by minimizing $F$ over nondecreasing $\beta_t$ . The "competitive ratio" framework compares the performance of an online learning-rate policy $\pi$ against the offline optimum, with the ratio

$\mathrm{CR}(\pi; z_{1:T}, h_{1:T}) = F^{\pi} / F^*$

and analyzes adaptation hardness for varying degrees of penalty non-monotonicity (Ito et al., 2024).

A critical advance is the development of adaptive learning-rate update rules such as Stability–Penalty Matching (SPM), which recursively select $\beta_t$ to match the contributions from stability and penalty terms: $\beta_t z_t = (\beta_t^{-1} - \beta_{t-1}^{-1}) h_t \quad \text{or equivalently} \quad \beta_t = \beta_{t-1} \cdot \frac{2}{1 + \sqrt{1 + 4 z_t \beta_{t-1}^2 / h_t}}$ This matching principle yields a competitive ratio within a constant factor of the lower bound for any sequence $h_{1:T}$ with bounded "approximate monotonicity," characterized by a parameter $\xi$ such that $h_1 \ge \xi h_t$ for all $t$ (Ito et al., 2024).

If $h_t$ is nonincreasing ( $\xi=1$ ), the competitive gap is constant; otherwise, it scales as $\Theta(\sqrt{\xi})$ .

3. Regularizer Construction and Geometry Adaptation

The choice of the regularizer $R$ is pivotal. For general online linear optimization (OLO), the optimal regret constant is governed not only by strong convexity of $R$ with respect to a dual norm (induced by the geometry of $\mathcal{K}$ and the loss set $L$ ), but also by tight control of the regularizer's range over the feasible set. Recent algorithmic techniques construct piecewise-quadratic or smoothed barriers, whose strong-convexity constant and upper bound are tailored to the action and loss sets, achieving regret within a universal constant of the minimax optimum (Gatmiry et al., 2024).

For certain structured problems (e.g., simplexes, Euclidean balls, positive semidefinite cones), analytic barriers such as negative entropy, Burg entropy, or log-determinant regularizers are deployed to exploit intrinsic geometry and sparsity of the losses (Moridomi et al., 2017).

Self-concordant regularizers, notably those used in the SCRiBLe algorithm, yield dimension-tight $O(d \sqrt{n \log n})$ regret rates for adversarial bandits over polytopes and ellipsoids, with Fenchel conjugacy and local norm properties critical for controlling variance and boundary behavior (Lévy et al., 28 Oct 2025).

4. Applications: Bandits, Best-of-Both-Worlds, and Beyond

FTRL is instantiated in several canonical problems:

Multi-Armed Bandits (MAB): With Tsallis or Shannon entropy regularizers, FTRL with adaptive learning rates achieves adversarial $O(\sqrt{KT})$ regret and stochastic $O(\sum_{i} \frac{\log T}{\Delta_i})$ (gap-dependent) regret, simultaneously satisfying the Best-of-Both-Worlds (BOBW) property (Ito et al., 2024, Zhan et al., 26 Oct 2025, Jin et al., 2023).
Graph-structured Bandits: By leveraging the independence number of the feedback graph, FTRL achieves $O(\sqrt{\zeta T})$ adversarial and $O(\zeta \log T)$ stochastic regret, with the learning rate and regularizer geometry matched to feedback structure (Ito et al., 2024).
Linear and Contextual Bandits: FTRL equipped with self-concordant barriers and second-order estimators achieves nearly instance-optimal rates, with $O(d \sqrt{T})$ or $O(d \log T)$ regret depending on adversarial or stochastic regimes (Ito et al., 2024, Kong et al., 2023, Lévy et al., 28 Oct 2025).
Partial Monitoring: SPB-matching and related learning-rate selection schemes extend BOBW guarantees to settings with indirect feedback and minimax regret of $\Theta(T^{2/3})$ (Tsuchiya et al., 2024).
Bandits with Structural Priors: Game-dependency, sparsity, and other prior knowledge can be incorporated through regularizer and learning-rate adaptation, yielding instance-dependent and structure-exploiting guarantees (Tsuchiya et al., 2023).

For closely related adversarial/optimistic settings, FTRL-type algorithms can be interpreted as gradient-based prediction algorithms or mapped to distributionally robust FTPL variants, connecting regret-optimal potential functions and efficiently implementing updates via bisection or sampling (Li et al., 2024).

5. Theoretical Guarantees and Lower Bounds

General FTRL regret bounds decompose into stability and penalty contributions; their optimization is fundamentally limited by the monotonicity of regularization parameters. The sharpest known lower bound, proved via competitive analysis, states that no online learning-rate policy can achieve better than $\Omega(\sqrt{\xi})$ times the offline optimum in the worst case of $\xi$ -approximately monotone penalty terms. SPM-based rules attain $O(\sqrt{\xi})$ -competitive regret, which is tight (Ito et al., 2024).

For fixed strongly convex $R$ and learning rate $\eta$ , the standard regret bound with subgradients $g_t$ (with dual norm $\|\cdot\|_*$ ) is

$\operatorname{Regret}_T(x^*) \leq \frac{R(x^*)}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \|g_t\|_*^2$

Adapting $\eta$ via SPM or similar schemes refines the trade-off and yields data-dependent regret (McMahan, 2014).

In complex feedback models (partial monitoring, time-varying constraints), penalized or modified FTRL schemes achieve $O(T^{2/3})$ or other problem-optimal rates, given structural properties or regularizer monotonicity (Tsuchiya et al., 2024, Leith et al., 2022).

6. Extensions, Variations, and Algorithmic Unification

The FTRL paradigm subsumes numerous variants via different choices of regularizer, linearization strategy, and implicit update mechanisms:

Generalized Implicit FTRL (GIFTRL) extends the update to interpolations between explicit (linearized) and implicit (full-loss) FTRL, capturing Mirror-Prox and aProx within the same duality framework via Fenchel-Young inequalities (Chen et al., 2023).
FTRL–Proximal and Centered FTRL correspond to, respectively, time-centered and current-point-anchored regularization, revealing deep equivalences with mirror descent and dual averaging, and enabling adaptive and per-coordinate learning rates in diagonal or block-structured geometries (McMahan, 2014, Ahn et al., 2024).
Bandit-robust and last-iterate convergence: Recent theory addresses not only cumulative regret, but also convergence rates of sequence endpoints (last-iterate analysis), exploiting continuity and stability of FTRL trajectories, particularly with Tsallis regularization in bandit problems (Zhan et al., 26 Oct 2025, Abe et al., 2022).

Optimal design of regularizers—potentially via high-dimensional convex programming exploiting action/loss set geometry—enables dimension-tight and instance-dependent regret, further strengthening FTRL universality (Gatmiry et al., 2024, Moridomi et al., 2017).

7. Insights, Implications, and Open Problems

FTRL enjoys multiple core advantages: universality with respect to problem structure, adaptability via learning-rate and regularizer tuning, and ability to yield both minimax and instance-optimal regret guarantees. Its deployment in bandit, contextual, and constrained online optimization covers settings with direct, partial, or ambiguous feedback. Key theoretical advances highlight:

The fundamental role of monotonicity in penalty coefficients for learning-rate adaptation, and competitive limits for online schedules.
The centrality of regularizer geometry, both for worst-case and data-dependent regret, and for algorithmic tractability in high dimensions.
The unification of last-iterate and cumulative-regret perspectives via Bregman divergence-based analysis.

Active research directions include (i) sharpening last-iterate and simple regret rates, (ii) automated selection or learning of optimal regularizer structures for non-canonical action/loss sets, (iii) extending FTRL unification to reinforcement learning and general non-convex settings, and (iv) efficient implementations bridging FTPL and FTRL in high-dimensional or bandit contexts (Ito et al., 2024, Gatmiry et al., 2024, Lévy et al., 28 Oct 2025).

FTRL remains a cornerstone of online learning theory and algorithm design, with ongoing impact across online decision problems involving adversarial dynamics, stochasticity, and adaptivity.