Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adaptive RL and MPC Switching (ARMS)

Updated 30 January 2026
  • ARMS is a hybrid control framework that fuses reinforcement learning with model predictive control to achieve adaptive performance and safety in dynamic environments.
  • It employs explicit RL-based, event-triggered, and neural soft switching mechanisms to reduce computational load while ensuring constraint satisfaction, with efficiency gains like reducing MPC calls to 25% or less.
  • Applications in embedded systems, mobile robotics, and safe human-robot interaction demonstrate ARMS’ effectiveness, achieving improvements such as lowering travel times and collision rates below 1%.

Adaptive Reinforcement and Model Predictive Control Switching (ARMS) frameworks address the challenging integration of learning-based controllers—such as reinforcement learning (RL) agents—with classical model predictive control (MPC), to achieve a control policy that adaptively balances performance, constraint satisfaction, sample efficiency, and computational resource expenditure. ARMS approaches are motivated by the trade-offs between the computational overhead of real-time, online MPC, the adaptability and expressivity of RL, and the need for rigorous safety or stability guarantees in complex, partially observed, or stochastic dynamic environments. Multiple instantiations of ARMS architectures have been proposed, notably for embedded systems, mobile robotics, and safe human-robot interaction (Bøhn et al., 2020, Shin et al., 2021, Liu et al., 23 Jan 2026).

1. Core Principles and Hybrid Architecture

ARMS is characterized by a modular, hybrid architecture that fuses high-level planning or safety filtering from an MPC module with a data-driven, adaptive RL policy. In canonical formulations, the system executes the following iterations:

  • The RL controller (policy πθ\pi_\theta) operates as the "default" policy, exploiting prior experience to select actions or plans rapidly.
  • The MPC module, either as a full receding-horizon optimizer (high-dimensional, high-fidelity planning) (Bøhn et al., 2020, Shin et al., 2021) or a one-step quadratic program (QP) safety filter (Liu et al., 23 Jan 2026), serves as a backup, filter, or planner invoked under certain risk or uncertainty conditions.
  • A switching mechanism—implemented as an event-triggered scheme, a stochastic policy learned by RL, or a learned neural gate—decides whether to take the RL action, the MPC/safety-filter action, or a convex blend thereof, at each control tick.

This structure enables the controller to benefit from the sample efficiency and reactivity of RL while preserving the safety, constraint satisfaction, or robust stabilization properties guaranteed by MPC in critical or high-risk scenarios. At deployment, the proportion of computationally intensive MPC calls is regulated, yielding resource and energy savings (Bøhn et al., 2020, Liu et al., 23 Jan 2026).

2. Switching Mechanisms and Policy Learning

The ARMS switching policy is central to maintaining a balance between performance and efficiency. Three main mechanisms have emerged:

  • Explicit RL-based switching: A parameterized policy πθ(a∣s)\pi_\theta(a|s) decides, at each step, whether to recompute the MPC plan or continue using a compensating controller. The augmented RL state incorporates the current and previous states, along with the elapsed duration since the last MPC update, yielding a Markovian decision process. The learning objective jointly optimizes control cost and computational burden:

rk=−ℓ(xk,uk)−λ I{ak=1}r_k = -\ell(x_k, u_k) - \lambda\, \mathbb{I}\{a_k=1\}

where λ>0\lambda>0 penalizes expensive MPC solves (Bøhn et al., 2020).

  • Event-triggered stochastic switching: Events (e.g., imminent collision, goal proximity) are detected, and a Bernoulli variable with a tunable probability ϵ\epsilon determines, upon an event, whether to invoke MPC or defer to the meta-learned policy. This mechanism interleaves high-quality MPC data into the RL policy's training, improving sample efficiency and robustness (Shin et al., 2021).
  • Neural soft switching: A context-aware switcher, implemented as a neural network, computes a gating parameter αt∈[0,1]\alpha_t \in [0, 1] as a function of risk- and feasibility features, enabling convex blending of RL and QP actions:

at=(1−αˉt)atf+αˉtatqpa_t = (1-\bar\alpha_t) a^f_t + \bar\alpha_t a^{qp}_t

The switcher is adaptively trained to favor the QP module in low-risk regions and the RL policy under high risk or infeasible QPs (Liu et al., 23 Jan 2026).

3. Controller and System Formulations

ARMS schemes admit distinct controller designs, unified by their modularity:

  • Full-horizon MPC: Solves a discrete-time optimal control problem over a finite horizon NN, enforcing dynamics, input/state constraints, and possibly risk-aware or safety-constrained objectives. The first control is applied, with subsequent plans cached (Bøhn et al., 2020, Shin et al., 2021).
  • Linear state feedback compensator: Used between MPC solves to correct nominal model prediction errors and achieve local stabilization at low computational cost. Typically designed via LQR on linearized dynamics (Bøhn et al., 2020).
  • One-step QP-based safety filter: Solves a QP at each step to minimally alter the RL policy's action to satisfy safety, clearance, and actuator constraints. If infeasible, control reverts to the RL follower (Liu et al., 23 Jan 2026).
  • Reinforcement learning controllers: Policies are trained using PPO (robot follower), PEARL+SAC (meta-learning navigator), or other off-policy algorithms, leveraging both raw and MPC-augmented data (Shin et al., 2021, Liu et al., 23 Jan 2026).

MPC and RL modules use shared or distinct state estimations. Perception modules commonly deploy VAE-based LiDAR encoders and LSTM-based temporal human-motion encoders to cope with partial observability and dynamic obstacles (Liu et al., 23 Jan 2026).

4. Applications and Empirical Performance

ARMS architectures have been extensively validated in both simulation and real-world settings, with application domains including:

  • Embedded/energy-constrained systems: Across tasks such as cart-pendulum swing-up and battery storage arbitrage, ARMS achieves near-optimal performance (<1% cost degradation) with considerably reduced MPC calls—only 25% or less compared to full-MPC, and up to 7% better profit than fixed-schedule baselines (Bøhn et al., 2020).
  • Mobile robot navigation in dynamic environments: In complex scenes, ARMS with event-triggered stochastic switching attains 0.92 success rate and reduces average travel time from 64.3 s (RL-only) to 38.6 s, while maintaining collision rates ≤1% and reducing meta-test compute by >80% compared to always-on MPC (Shin et al., 2021).
  • Safe human-robot cooperative navigation: With neural soft switching, ARMS yields 82.5% success in highly cluttered environments, outperforming Dynamic Window Approach (75.4%) and RL-only baselines (79.4%), and achieves 33% lower latency per control step (5.2 ms) compared to a multi-step MPC baseline (Liu et al., 23 Jan 2026).

Transfer to Gazebo simulation and initial real-world deployment (Clearpath Ridgeback with planar LiDAR and RGB-D camera) indicate robustness of the ARMS design under realistic sensor noise and communication latencies (Liu et al., 23 Jan 2026).

5. Theoretical Properties and Stability Considerations

ARMS designs harness several theoretical guarantees:

  • Markov Decision Process Guarantee: Augmentation of the switching policy's state as sk=[xk,xk0,k−k0]s_k = [x_k, x_{k_0}, k-k_0] yields a Markovian system, admitting standard convergence analysis under policy-gradient RL (Bøhn et al., 2020).
  • Recursive Feasibility and Stability: Whenever the switching mechanism triggers a full MPC recomputation, recursive feasibility and stability are assured via proper terminal costs and constraints. Between recomputations, the compensating controller maintains bounded error dynamics (Bøhn et al., 2020).
  • Safety Filter Guarantees: In architectures with one-step QP filters, control actions always satisfy hard physical and safety constraints so long as the QP is feasible; fallback to RL only occurs under infeasibility, mitigating the risk of policy "freezing" in constrained passages (Liu et al., 23 Jan 2026).

The reward structures used in switching policy learning directly instantiate the performance–cost or safety–efficiency trade-offs, ensuring optimal recomputation or fusion decisions subject to user-tunable penalties.

6. Implementation and Computational Considerations

ARMS frameworks have been deployed in real-time control contexts with tight computational constraints:

  • Control Loop Rates: ARMS achieves stable closed-loop performance at rates of 10–20 Hz, with per-step latency for RL inference and neural switching under 0.15 ms; QP solves are performed within ∼1 ms using convex optimization libraries (e.g., OSQP) (Liu et al., 23 Jan 2026).
  • Action and Sensing Pipelines: Modular observation encoders, including VAE for LiDAR and LSTM for human–robot state, are decoupled and frozen during policy learning to accelerate downstream inference (Liu et al., 23 Jan 2026).
  • Deployment Platforms: Demonstrations encompass simulated physics, Gazebo environments, and physical robots with on-board sensors and computation. Open-source code is available for reproduction and further research (Liu et al., 23 Jan 2026).

Typical ARMS implementations leverage standard RL libraries (e.g., Stable Baselines PPO), real-time operating frameworks (e.g., ROS), and well-established MPC solvers.

7. Context and Extensions

The ARMS paradigm generalizes to a range of hybrid control and safe learning problems, providing a flexible interface between classical model-based control and modern data-driven adaptation. ARMS complements and extends classical shared-authority and safety-filtering approaches by embedding adaptive, context-aware switching or blending, facilitating scalable deployment to realistic, resource-constrained, and uncertain environments. Comparative studies indicate that ARMS architectures consistently outperform non-adaptive MPC, RL-only, and heuristic switching baselines across a spectrum of domains and risk profiles (Bøhn et al., 2020, Shin et al., 2021, Liu et al., 23 Jan 2026). A plausible implication is that further enhancements to adaptive switching mechanisms (for example, via risk-sensitive or model-based RL) and tighter coupling between model learning and controller design could improve sample efficiency, transferability, and robustness of future ARMS systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Reinforcement and Model Predictive Control Switching (ARMS).