Just-In-Time Reinforcement Learning

Updated 1 February 2026

Just-In-Time Reinforcement Learning (JitRL) is a framework that enables real-time policy adaptations to handle non-stationary and partially observed environments across diverse domains.
JitRL employs techniques like test-time policy optimization, gradient-based updates, and real-time model planning to ensure low-latency adaptation and operational safety.
JitRL has demonstrated improved performance in applications such as digital health, robotics, and edge computing by integrating continual learning with safety guarantees.

Just-In-Time Reinforcement Learning (JitRL) denotes a family of reinforcement learning systems and algorithmic frameworks designed to enable agents to optimize their behavior or policies during deployment, immediately and adaptively in response to non-stationary conditions, partial observability, or newly encountered experience. Unlike classical RL, which separates offline training and online application, JitRL architectures embed policy improvement, value estimation, or control-law adaptation within the real-time or test-time execution loop. JitRL methods have been developed for domains ranging from health interventions and robotic control to edge computation orchestrations and LLM agent adaptation. Representative instantiations include continual learning in frozen LLM agents without parameter updates, policy adaptation of vision-language-action agents via test-time RL, model-based real-time planning in robotics, and robust RL in non-stationary systems (Karine et al., 2023, Li et al., 26 Jan 2026, Liu et al., 11 Jan 2026, Hester et al., 2011, Hamadanian et al., 2022).

1. Foundational Principles and Motivation

JitRL is motivated by the need for adaptivity in dynamic, partially observed, or otherwise non-stationary environments where pre-trained policies are insufficient. In such settings, agents must:

Continuously optimize decision-making in the presence of changing states, unknown contexts, or unforeseen feedback.
Achieve low-latency updates to the control or action-selection logic so as not to compromise real-time constraints or responsiveness.
Avoid catastrophic forgetting, instability, or performance degradation associated with classical, batch-based retraining or delayed-gradient-based policy updates.

JitRL frameworks explicitly address these requirements through (i) test-time policy modification (with or without gradient steps), (ii) memory-augmented inference and retrieval-based advantage estimation, (iii) just-in-time online learning of models or policies, (iv) and architecture-level support for parallelism and safety (Karine et al., 2023, Li et al., 26 Jan 2026, Hester et al., 2011).

2. Core Algorithmic Mechanisms

JitRL implementations fall into several categories, including:

Test-time policy optimization without gradient updates: e.g., additive logit modulation in LLM agents using non-parametric memory of experience tuples $(s, a, G)$ , with retrieval-based estimation of action advantages and a closed-form KL-constrained policy improvement rule (Li et al., 26 Jan 2026).
Within-deployment gradient-based updates: e.g., lightweight PPO steps applied every few environment interactions to vision-language-action policies, constrained to remain close to pre-trained priors (Liu et al., 11 Jan 2026).
Real-time model-based planning: concurrent acting, model learning, and Monte Carlo tree search (MCTS) planning, with mutex-protected shared buffers to guarantee sub-millisecond action selection, as in the rtmba architecture (Hester et al., 2011).
RTI-NMPC in RL loops: embedding a single-step sequential quadratic programming (SQP) correction into policy evaluation and improvement, enabling sub-millisecond closed-loop receding horizon control for continuous domains (Zanon et al., 2020).

The table below summarizes principal mechanisms across representative works:

JitRL Variant	Policy Update	Memory	Computational Mode
LLM-JitRL (Li et al., 26 Jan 2026)	Logit modulation	Non-parametric	Retrieval, no gradient
TT-VLA (Liu et al., 11 Jan 2026)	PPO-style gradient	Replay buffer	Test-time policy grad
rtmba (Hester et al., 2011)	Tree search/ $Q$	Experience buf	Parallel, fixed-latency
RTI-NMPC RL (Zanon et al., 2020)	RTI SQP step	Past traj.	Real-time optimization

JitRL methods are characterized by the ability to rapidly integrate new feedback and adjust action selection pathways, typically modulating the influence of past priors via trust-region, KL, or memory-based constraints.

3. Formal Problem Formulations

Several rigorous mathematical templates underpin JitRL systems:

JITAI as MDP/POMDP: In digital health, JITAIs are posed as MDPs or POMDPs with latent context, habituation, and disengagement risk components; observation errors and partial state readability are parametrically modeled, and distinct observation set configurations are considered to assess robustness of policy search algorithms (Karine et al., 2023).
Online POMDP adaptation: In vision-language-action tasks, the POMDP state comprises both proprioceptive and visual history with language instruction; reward is defined via dense progress signals, and policy parameters θ are updated within-episode under a KL-regularized surrogate objective (Liu et al., 11 Jan 2026).
Non-stationary MDP regime switches: JitRL in dynamic systems employs online clustering or change-point detection to index environments, maintaining per-environment expert policies, triggering targeted exploration, and reverting to safeguarded fallback policies upon detecting unsafe system states (Hamadanian et al., 2022).
KL-constrained policy optimization: In LLM-based JitRL, the optimization seeks π* maximizing $\mathbb{E}_{a\sim π'}[A(s,a)] - (1/\beta) D_{KL}(π', π_θ)$ , delivered in logit space as $z'(s,a) = z(s,a) + β\tilde{A}(s,a)$ (Li et al., 26 Jan 2026).

These formulations enable principled analysis of sample efficiency, stability, and safety properties.

4. Representative Experimental Findings

JitRL frameworks have been empirically validated in diverse contexts:

Adaptive Intervention under Partial Observability (Karine et al., 2023):

Full propagation of context-inference uncertainty to the policy (inputting $p_t$ rather than $l_t$ ) significantly boosts long-term rewards and intervention efficacy, with up to +500 steps average return over classical approaches.
Policy-gradient methods exhibit higher robustness to state aliasing than value-function approaches (DQN), with statistical tests confirming superiority under hidden state conditions.

Test-Time VLA Adaptation (Liu et al., 11 Jan 2026):

TT-VLA test-time adaptation increases simulated robot task success rates by 2–6% (absolute), up to 44% (relative), and realizes consistent real-robot performance gains.
Ablations reveal the necessity of one-step dense progress signals and an update frequency tuned for stability.

Real-Time Model-Based Control & RTI-NMPC (Hester et al., 2011, Zanon et al., 2020):

rtmba achieves sub-millisecond action selection, fast convergence (20–30 episodes on Mountain Car), and high wall-clock efficiency relative to sequential planners.
RTI-NMPC RL obtains a 21% average economic cost improvement over naïve NMPC parameters on nonlinear process control benchmarks, while retaining strict constraint satisfaction.

Continual Learning in LLM Agents (No Gradient Updates) (Li et al., 26 Jan 2026):

JitRL surpasses training-free baselines in WebArena (final success: 51.4% vs. ≤44%) and outperforms expensive RL fine-tuning by 7–13% absolute success, at <3% of the computational cost.
In Jericho text games, JitRL delivers markedly higher game scores and learning speed compared to both prompt-based and gradient RL approaches.

5. Robustness, Safety, and System-Level Guarantees

Robustness in JitRL is achieved through:

Uncertainty propagation: feeding entire context posteriors into the policy network, leading to risk-averse but effective behaviors in uncertain or ambiguous situations (Karine et al., 2023).
Change-point detection and multi-expert retention: strict performance safety in non-stationary environments is enforced by maintaining a catalog of per-environment experts and controlling exploration phases, with bounded regret under fallback heuristics (Hamadanian et al., 2022).
Trust-region/surrogate objectives: PPO-style clipped losses and KL constraints prevent catastrophic policy drift during test-time adaptation (Liu et al., 11 Jan 2026).
Real-time thread separation: parallelism in rtmba ensures all inferences meet latency and real-time actuation windows (Hester et al., 2011).

6. Limitations and Future Directions

JitRL frameworks face several open challenges:

Memory and retrieval scaling: As episodic memory grows, efficient retrieval and compression strategies are required for LLM-based JitRL (Li et al., 26 Jan 2026).
Reliance on auxiliary estimators: Test-time RL for robot control is dependent on the quality of dense progress evaluators; incorrect estimation can destabilize adaptation (Liu et al., 11 Jan 2026).
Delayed adaptation in highly non-stationary, multi-agent contexts: Centralized controllers, as in edge inference orchestration, may become bottlenecks or introduce scalability limitations (Mounesan et al., 31 Jan 2025).
Evaluator and similarity metric robustness: Performance depends on the consistency of LLM-based reward evaluators and state abstraction fidelity.
Extension to continuous action/vocabulary spaces: Current JitRL for LLMs is limited to structured, discrete candidate spaces; generalizing to open-ended continuous or compositional action sets remains unresolved.

Research directions include improved non-parametric memory management, integration with attention-based or learned retrieval, richer safety-aware RL at deployment, and the fusion of JitRL with interpretable policy adjustment and meta-learning frameworks (Li et al., 26 Jan 2026, Liu et al., 11 Jan 2026).

7. Synthesis and Application Domains

JitRL is now applied across a spectrum of domains:

Digital health: Online adaptive intervention selection with safety, robustness to partial observability, and context uncertainty (Karine et al., 2023).
Autonomous robotics: Deployment-ready policy adaptation and real-time control with provable performance and deadlines (Hester et al., 2011, Liu et al., 11 Jan 2026).
Edge computing and AI orchestration: Dynamic, per-slot adaptation of DNN inference graphs in response to bandwidth, energy, and latency constraints (Mounesan et al., 31 Jan 2025).
Software agents and continual learning: Training-free, continual learning in LLM agents for web navigation and sequential decision making (Li et al., 26 Jan 2026).
Non-stationary systems: Robust RL capable of rapid context identification, knowledge retention, and safeguard-enforced exploration in live production systems (Hamadanian et al., 2022).

JitRL thus provides an architectural and algorithmic toolkit for on-the-fly RL adaptation, blurring the boundary between policy deployment and optimization and enhancing agent resilience across real-world tasks.