Infinite Learning Rate Schedules
- Infinite learning rate schedules are policies defined for arbitrarily long training durations, eliminating the need for a preset stopping horizon.
- They incorporate methods like generative latent ODEs, schedule-free averaging, and hyperbolic parametric forms to adaptively generate learning rates.
- These approaches enhance model performance by improving generalization, reducing hyperparameter tuning, and supporting continual and large-scale training.
An infinite learning rate schedule refers to a learning rate policy that is functionally unbounded in both representational capacity and practical duration: it either parameterizes the learning rate as a function defined for arbitrarily long training time, adaptively generates new schedules indefinitely, or removes the need for a predetermined stopping iteration altogether. Unlike classical schedules (step, polynomial, or cosine decays), which have a fixed, low-dimensional parameterization and often require tuning to a specific horizon, infinite learning rate schedules are designed for settings where training may be open-ended, data distribution may change (as in continual learning), or schedule search is conducted over a genuinely infinite-dimensional space. These methods are motivated both by theoretical advances in online optimization and by empirical demands for flexibility and improved generalization in deep learning.
1. Foundational Types and Formalizations
Infinite learning rate schedules encompass several algorithmic and formal paradigms:
- Generative continuous-time models: Latent neural ODEs are learned from hyperparameter sweeps, mapping training-loss, validation-metric, and learning-rate trajectories into a smooth latent space. The resultant ODE defines a generator from which unboundedly many distinct, high-performing schedules can be sampled for new models or datasets. Crucially, the scheduling process is not constrained to any pre-defined family (cosine, exponential, etc.) and adapts to new data and history at inference time. This approach guarantees infinite schedule variety while maintaining stability and generalization (Sampson et al., 27 Sep 2025).
- Schedule-free and averaging-based approaches: By replacing explicit, horizon-dependent schedules with online-to-batch averaging of iterates, the effective learning rate for the output sequence decays as $1/t$ without ever specifying a total number of steps . Notably, the parameter update for the “slow” average is
with the “fast” sequence performing unconstrained steps. The approach offers convergence rates matching the optimal theory and does not require rescheduling or grid tuning across different horizons (Defazio et al., 2024).
- Hyperbolic and infinite-horizon parametric schedules: Explicit schedule functional forms, such as HyperbolicLR and ExpHyperbolicLR, possess asymptotics that are independent of training horizon. After setting hyperparameters from a short run, these schedules extend stably to arbitrary training lengths. In the limit , the slope of HyperbolicLR approaches zero, producing an effectively infinite-horizon policy (Kim, 2024).
- Adaptive/local control and online hypergradient methods: Algorithms that adapt the learning rate at every step (“on the fly”) via local curvature estimation (e.g., quadratic fits, hypergradients), stochastic line searches, or discounted accumulation of gradients. These methods can construct a continually evolving, arbitrarily complex learning-rate function, without pre-committing to any finite parameterization (Iyer et al., 2021, Kafka et al., 2020, Donini et al., 2019).
- Theory-motivated, landscape-aware annealing: In high-dimensional, nonconvex optimization, theory predicts that to guarantee escape from saddles, the integral must diverge ( for ). The optimal schedule, proven in glassy and spiked models, combines a flat/infinite phase (“exploration”) followed by transition to a $1/t$ decay (“convergence”), with the switch time determined by properties of the loss landscape rather than an a priori horizon (d'Ascoli et al., 2022).
- Functional surrogate and schedule optimization frameworks: Functional Scaling Laws (FSL) model the effect of any (possibly infinite-dimensional) schedule on expected risk by analytically tractable convolution functionals. This reduces infinite-dimensional schedule optimization to variational problems and enables surrogate loss-based search over extremely rich schedule families (Li et al., 23 Sep 2025).
- Continual and open-ended pre-training schedules: Schedules such as “infinite cosine” are built to enable continual pre-training over indefinite time, eliminating rewarm phases and the need to predefine training duration. These phase-stitched schedules—warmup, cooldown, constant plateau, annealing—can be extended indefinitely as tasks and data arrive, maintaining performance without catastrophic forgetting (Singh et al., 4 Mar 2025).
2. Mathematical Foundations and Schedule Families
Infinite learning rate schedules generalize or subsume several classical families:
- Classical schedules: Polynomial, exponential, and cosine decays are fixed, finite-dimensional functions requiring prior knowledge of run length. Their schedule functions typically take the form:
| Schedule Type | Formula | Horizon-Dependence | |--------------------|------------------------------------------|--------------------| | Polynomial (poly) | | Yes | | 1/t | | Yes (singular at 0)| | Exponential | | Yes | | Cosine | | Yes |
- HyperbolicLR:
exhibits a constant late-stage slope and is epoch-insensitive (Kim, 2024).
- Latent ODE schedules: The schedule emerges as the decoded output of a neural ODE latent variable system, sampled and averaged across an ensemble for future steps (Sampson et al., 27 Sep 2025).
- Functional optimization: Infinite-dimensional schedule optimization may be posed as minimizing subject to , where is the predicted risk under the FSL surrogate (Li et al., 23 Sep 2025).
- Best-in-class adaptive/local mechanisms: Online hypergradient schemes iteratively update based on real-time estimates of the hypergradient and discount factors to enable convergence over (Donini et al., 2019).
3. Empirical Behavior and Theoretical Guarantees
Empirical and theoretical results delineate key properties:
- Generalization and flat minima: Latent ODE–generated infinite schedules (LODE) consistently attain flatter loss landscape minima (lower maximum Hessian eigenvalue) and higher validation accuracy than cosine, OneCycle, or hand-tuned step schedules. For example, for ResNet-18 on CIFAR-100, for LODE versus for cosine decay, with final accuracy improved and lower variance across seeds (Sampson et al., 27 Sep 2025).
- Independence from training horizon: Infinite-horizon schedules (HyperbolicLR/ExpHyperbolicLR, schedule-free averaging) give nearly invariant performance on learning curves and test accuracy as the number of epochs doubles or quadruples, with deviations (–) substantially lower than for polynomial or cosine-annealing (which can drift by –) (Kim, 2024).
- No catastrophic forgetting in continual learning: In continual SSL pre-training, infinite schedules (inf-cosine) eliminate rewarm-induced forgetting observed in repeated cosine schedules. On CIFAR-10 continual tasks, inf-cosine achieves higher average accuracy (60.03% vs. 58.16%) and less negative backward transfer (–12.61% vs. –17.65%) (Singh et al., 4 Mar 2025).
- Exploration-convergence phase matching: For high-dimensional nonconvex landscapes, the optimal infinite-horizon policy is to anneal the learning rate in two distinct phases: retain large (exploration) until crossover time , then decay as $1/t$ for optimal convergence (d'Ascoli et al., 2022).
- No need to pre-specify : In schedule-free averaging, the user never sets a stopping time, and the output solution at any time has effectively received a $1/t$-decayed sequence of updates (Defazio et al., 2024).
- Surrogate-based optimal schedule search: The FSL surrogate model enables optimization over infinite-dimensional schedule families, leading to data-driven schedule shapes (typically warmup/stable/decay). Empirically, FSL-optimized schedules outperform classic cosine, exponential, and manually tuned schedules in LLM pre-training (Li et al., 23 Sep 2025).
4. Algorithms, Implementation, and Design Guidelines
The following table summarizes representative algorithms and their schedule parametrization properties:
| Approach | Schedule Family | Infinite/Adaptive | Requires T? | Implementation Complexity |
|---|---|---|---|---|
| Latent ODE (LODE) | Generated by neural ODE | Infinite, generative | No | Neural ODE model + experiment logs |
| Schedule-Free | 1/t via averaging | Infinite, online | No | As in AdamW/SGD, 1/t fallback |
| HyperbolicLR | Hyperbolic/exphyp parametric | Infinite-horizon | No | Few lines (closed form) |
| Online hypergradient | Any, via hypergradients | Infinite, adaptive | No | O(d) per step, discounted accum. |
| GOLS-I line search | Any, via sign change | Infinite, online | No | Per-step line search, all opt. |
| InfCosine | Phase-stitched (plateau/anneal) | Infinite, open-ended | No | Simple scheduler, open epochs |
Design and deployment considerations include:
- For latent ODE scheduling, experiment logs are used to train the ODE “offline,” and inference is performed live, with a moderate inference overhead ( wall time per step). The approach is optimizer-agnostic and slots into any modern metric-logging stack (Sampson et al., 27 Sep 2025).
- For averaging-based methods, implement the averaging step alongside the main optimizer; memory overhead is a single copy of the parameters (Defazio et al., 2024).
- HyperbolicLR requires only choosing , , and (max epoch horizon) one time. Schedules remain well-conditioned and uni-modal regardless of epoch length (Kim, 2024).
- Local adaptive/online methods (LRTuner, GOLS-I, MARTHE) incur modest additional computation per step, but require no post hoc tuning, grid search, or schedule sweeps. Practical use recommends downweighting momentum terms in some optimizers to retain line search effectiveness (Iyer et al., 2021, Kafka et al., 2020, Donini et al., 2019).
- For surrogate FSL optimization, the recommended workflow is to fit the FSL parameters on a pilot loss curve, parametrize by trackable finite-difference coordinates, run projected/gradient optimization under the resource constraint, and deploy the resulting optimized schedule (Li et al., 23 Sep 2025).
5. Theoretical Models, Surrogates, and Scaling Laws
Theoretical models for infinite learning rate schedules focus on:
- Functional scaling laws (FSL): The training loss curve, under SGD with an arbitrary schedule, decomposes into bias, fit, and convolutional noise integrals that are analytic in the schedule . For a general batch policy, the schedule effect is
where is the "forgetting" kernel and the batch risk decay. This enables explicit risk or scaling predictions for arbitrary , including unparameterized, infinite-dimensional families (Li et al., 23 Sep 2025).
- High-dimensional annealing theory: For -dimensional nonconvex models, keeping is necessary for saddle escape. The optimal policy is to run with a large until escaping, then transition to for fast convergence, matching the empirical behavior of teacher–student neural network regression and glassy spin models (d'Ascoli et al., 2022).
- Averaging-based optimality: Theoretical results demonstrate that schedule-free averaging methods achieve statistical efficiency rates without requiring any prior on the training horizon or explicit schedule decay tuning (Defazio et al., 2024).
- Empirical scaling: Infinite schedule policies (“stable plateau” or WSD) empirically achieve or surpass the predicted scaling exponents in LLMs and kernel regression, sometimes eliminating the logarithmic loss incurred by exponential-decay schedules (Li et al., 23 Sep 2025).
6. Applications and Implications
Infinite learning rate schedules serve in diverse contexts:
- Continual/self-supervised pre-training: Infinite schedules allow continual model updates as new data arrive, avoiding catastrophic forgetting caused by traditional cosine annealing with restarts. This supports lifelong and uncapped pre-training workflows as increasingly common in open-ended, streaming-data settings (Singh et al., 4 Mar 2025).
- LLM and foundation model optimization: Surrogate schedule optimization via FSL and infinite schedule techniques are applied to large-scale LLMs (GPT-2, QwenMoE, LLaMA2/3), yielding improved final loss and compute/data efficiency relative to 8-1-1 and cosine baselines (Li et al., 23 Sep 2025).
- General modular scheduling: Approaches such as HyperbolicLR, schedule-free averaging, and local adaptive policies (LRTuner, GOLS-I, MARTHE) are compatible with a wide range of optimizers (SGD, AdamW, Adagrad) and problems (vision, language, operator learning, time series), providing universal plug-in replacements for parametric schedules (Kim, 2024, Iyer et al., 2021, Kafka et al., 2020, Donini et al., 2019).
- Resource-uncertain and long-horizon training: Infinite-horizon schedules reduce the need for repeated tuning and schedule restarts, ensuring consistent performance even when the eventual run length or data availability is uncertain (Defazio et al., 2024, Kim, 2024).
- Hyperparameter elimination: Methods such as GOLS-I and schedule-free averaging remove the most sensitive hyperparameter (the schedule itself) from the training workflow, auto-tuning over up to fifteen orders of magnitude in step size, drastically reducing meta-optimization cost (Kafka et al., 2020).
7. Challenges, Limitations, and Practical Guidance
While infinite learning rate schedules offer broad advantages, practical implementation faces specific challenges:
- For latent ODE and FSL surrogate-based approaches, initial pilot data for fitting is needed and may bias schedule shapes if not representative (Sampson et al., 27 Sep 2025, Li et al., 23 Sep 2025).
- Momentum-heavy optimizers can interfere with certain line search and hypergradient-based schedules, warranting careful momentum tuning or algorithm-specific adaptation (Kafka et al., 2020).
- Hyperparameter selection (e.g., fixed , for HyperbolicLR) is still required on the base optimizer, though not for the schedule horizon (Kim, 2024).
- Surrogate schedule optimization, while theoretically tractable, produces curve shapes that may need further empirical fine-tuning in practice (Li et al., 23 Sep 2025).
- Schedule adaptivity may introduce minor computational overhead (up to 25% for latent ODE schedule inference), though this is still lower than reinforcement learning or large grid searches (Sampson et al., 27 Sep 2025).
Current guidelines emphasize:
- Train schedule-free or infinite-schedule policies with minimal horizon dependence for open-ended and continual domains.
- For resource-constrained or non-stationary environments, prefer HyperbolicLR, schedule-free, or generative approaches over classical finite-horizon policies.
- Use FSL surrogates for functional optimization if extensive prior data are available, particularly for large-scale LLM training and scaling studies.
- When implementing in standard pipelines, online adaptive methods are likely to require only incremental modifications to existing code bases.
References:
(Sampson et al., 27 Sep 2025) “Dynamics of Learning: Generative Schedules from Latent ODEs” (Kim, 2024) “HyperbolicLR: Epoch insensitive learning rate scheduler” (Iyer et al., 2021) “LRTuner: A Learning Rate Tuner for Deep Neural Networks” (Donini et al., 2019) “MARTHE: Scheduling the Learning Rate Via Online Hypergradients” (Kafka et al., 2020) “Gradient-only line searches to automatically determine learning rates…” (Defazio et al., 2024) “The Road Less Scheduled” (d'Ascoli et al., 2022) “Optimal learning rate schedules in high-dimensional non-convex optimization problems” (Li et al., 23 Sep 2025) “Unveiling the Role of Learning Rate Schedules via Functional Scaling Laws” (Singh et al., 4 Mar 2025) “Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training”