Transformer Surrogates for Partial Control

Updated 2 February 2026

Transformer surrogates for partial control are specialized models that predict control signals using limited, tokenized system data and innovative masking techniques.
They leverage architectures like encoder-only transformers, masked autoencoders, and decision transformers to manage partial observability and selective actuation effectively.
Recent frameworks incorporate low-rank adaptations and compute-adaptive modules, achieving reduced prediction errors and real-time control in applications such as PDE regulation and controllable image synthesis.

Transformer surrogates for partial control refer to transformer-based models specifically designed or adapted to serve as surrogates for control or inference tasks where only a subset of system inputs, states, or outputs are actuated, observed, or directly manipulated. These surrogates leverage transformer architectures—encoder-only, masked autoencoders, sequence models, or hybrid multimodal forms—to efficiently emulate the control-relevant dynamics, predict future behavior, or generate actuated outputs, often under constraints such as limited data, partial observability, or compute-adaptive inference. Recent advances have demonstrated transformer surrogates in domains ranging from explicit model predictive control, control of PDEs, image synthesis with controllable conditions, and closed-loop system identification from sparse signals.

1. Theoretical Foundations and Motivation

Transformer surrogates for control tasks are motivated by the limitations of classical, mechanistic control frameworks, especially in settings with complex dynamics, variable horizons, large state/action spaces, and partial observability. Traditional online Model Predictive Control (MPC) faces excessive computational complexity due to the need for real-time optimization, and explicit MPC mitigates this by solving policies offline. However, explicit MPC methods rely on restrictive assumptions—linear systems, fixed horizons, and simple costs—leading to exponential region-partition growth and poor adaptability in nonlinear or high-dimensional regimes (Wu et al., 9 Sep 2025).

In partially actuated or observed systems (partial control), only selected components of the state or action vectors are accessible or relevant. The surrogate must efficiently generate control signals or predictions for these "partial" components as well as accommodate incomplete or noisy observations, requiring flexible tokenization and decoding strategies (Zhang et al., 2024).

2. Architectures of Transformer Surrogates for Partial Control

Transformer surrogates for partial control utilize several architectural paradigms to encode relevant quantities, handle variable input/output granularity, and manage partial observability:

Encoder-only Transformer Policies: For control sequences, the optimal action sequence $U^* = [u_0, \ldots, u_{H-1}]$ is approximated by a transformer policy $\pi_{\theta}$ using bidirectional self-attention. State, reference, and (for partial control) only selected components are tokenized via embedding layers; output decoding is adapted to emit only the required subset $u^p$ (Wu et al., 9 Sep 2025).
Masked Autoencoder Transformers: In surrogate modeling for multi-modal systems (e.g., fused scalar/image outputs), masked autoencoder transformer backbones allow for reconstructing missing or partially observed outputs, enabling state estimation or control even under partial feedback. The masked reconstruction loss directly supports partial control by interpolating unobserved targets using available data (Olson et al., 2023).
Decision Transformer Sequence Modeling: For partially observable systems, decision transformers (DT) encode histories of observations, actions, and (optionally) rewards-to-go as token sequences. The output head is designed to predict only the actuated parts or next actions, allowing direct deployment in partial-control closed-loops (Zhang et al., 2024).
Plug-in Modules for Partial Control/Computation: Compute-adaptive patch-based surrogates employ lightweight modules—Convolutional Kernel Modulator (CKM) and Convolutional Stride Modulator (CSM)—wrapped around tokenization stages. These modulate the granularity of control or prediction (patch size) at inference without retraining, facilitating precise partial deployment on budget-constrained systems (Mukhopadhyay et al., 12 Jul 2025).
Low-Rank Adaptation (LoRA) and Conditional Branching: LoRA modules introduce low-rank parameter updates at attention blocks, allowing efficient specialization for conditional or partial control signals (e.g., for conditioning image synthesis on ancillary modalities or actuators) (Liu et al., 14 Aug 2025).

3. Training and Optimization Frameworks

Surrogate transformer training for partial control exploits diverse direct-optimization and adaptation schedules:

Direct Policy Optimization: Unlike imitation-based approaches dependent on precomputed trajectories, direct policy optimization alternates sampling and learning phases. Horizons $H$ and states $x$ are drawn i.i.d., action sequences produced by the surrogate, and gradients computed via automatic differentiation of the rollout cost (Wu et al., 9 Sep 2025). Regularization (e.g., weight decay, penalty $\lambda\|\theta\|^2$ ) stabilizes training.
Masked Pretraining and Fine-tuning: Surrogate models pretrain on large simulation or synthetic datasets (multi-modal, multi-task) with both forward prediction and masked reconstruction objectives. Fine-tuning on limited real data is restricted to select decoder blocks, with post-hoc bias/variance correction applied to match target experimental distributions (Olson et al., 2023).
Low-Rank Adaptation (LoRA) for Transfer: Transfer learning from large-scale language-model pretraining is accomplished by freezing main backbone weights and optimizing only low-rank adapter modules for the desired partial-control task, dramatically reducing sample complexity while retaining generalization (Zhang et al., 2024, Liu et al., 14 Aug 2025).
Graph-Based Hyperparameter Optimization: Discrete hyperparameter grid searches and graph smoothing are used to choose optimal configurations for training/fine-tuning under sparse data regimes, particularly when cross-validation sets are small or partial (Olson et al., 2023).

4. Handling Partial Actuation, Observability, and Control Targeting

Partial control scenarios are accommodated via targeted tokenization, decoding, embedding strategies, and attention-masking:

Token Subsetting and Masking: Transformers can tokenize only the relevant actuated/observed components (e.g., $x^p$ , $u^p$ ) and ignore others. Output heads or decoders are dimensionally reduced to produce actions for only selected actuators, with corresponding adjustments to the cost function $\ell_p(x^p, u^p)$ (Wu et al., 9 Sep 2025).
Masked Reconstruction and Partial Feedback: In settings with partial output feedback, e.g., observing subsets of scalar/image channels, the surrogate uses masked transformer decoding to interpolate or reconstruct unobserved states, improving state estimation for control loops (Olson et al., 2023).
Causal Attention and Patch Modulation: For temporally evolving systems, causal attention or dynamic patch/stride control enables partial rollout over time, with computation-adaptive granularity (e.g., via CKM/CSM modules) for budget-constrained inference (Mukhopadhyay et al., 12 Jul 2025).
KV-Context Augmentation for Conditional Fusion: LoRA-style side branches at every attention block concatenate key/value streams for condition-specific features, enabling deep, persistent fusion of control signals into generative models with negligible overhead (Liu et al., 14 Aug 2025).

5. Computational Efficiency and Generalization Properties

Transformer surrogates for partial control achieve computational efficiency and generalize effectively across horizon lengths, system parameters, and control subsets:

Inference Complexity: Single forward passes cost $O(H \cdot d^2)$ FLOPs for sequence transformers (bidirectional policy, (Wu et al., 9 Sep 2025)), and $\pi_{\theta}$ 0 attention cost for patch-based models, with rapid sub-millisecond inference even for $\pi_{\theta}$ 1.
Parameter and FLOPs Overhead: Plug-in lightweight modules (CKM, CSM, LoRA control branches) add $\pi_{\theta}$ 2 to overall model parameters or FLOPs compared to duplication-based control methods (Liu et al., 14 Aug 2025).
Generalization and Adaptation: Randomized horizon/state sampling and masking during training ensures uniform coverage over the full controllable state/action/horizon domain. In multi-task and zero/few-shot settings, transformer surrogates exhibit steep learning curves and rapid adaptation, surpassing traditional controller performance with only 5–10 adaptation trajectories (Zhang et al., 2024).
Empirical Rollout Results: Compute-elastic surrogates (CKM/CSM + cyclic patch-size rollout) suppress long-horizon grid artifacts, reduce prediction error by 16–50% on 2D/3D PDE benchmarks, and enable flexible cost-accuracy tradeoff at inference (Mukhopadhyay et al., 12 Jul 2025).

Model/Module	Partial Control Mechanism	Overhead (%)
TransMPC (Wu et al., 9 Sep 2025)	Output/token subsetting, cost	negligible
DT-LoRA (Zhang et al., 2024)	Sequence masking, LoRA head	$\pi_{\theta}$ 3
NanoControl (Liu et al., 14 Aug 2025)	KV context LoRA sidebranch	$\pi_{\theta}$ 4 params
CKM/CSM (Mukhopadhyay et al., 12 Jul 2025)	Encoder/decoder modulator	zero

6. Domain-Specific Applications and Limitations

Transformer surrogates for partial control are deployed across a spectrum of domains:

Explicit MPC and Partial Actuation: Real-time trajectory generation for vehicle tracking, robotic obstacle avoidance, and control problems with variable actuator sets are efficiently solved via encoder-only transformer surrogates (Wu et al., 9 Sep 2025).
Surrogate Modeling in Physics/Simulation: Multi-modal surrogates for inertial confinement fusion predict both scalar burn metrics and X-ray images given arbitrary experimental/control design parameters. Partial feedback and masked reconstruction enable robust state estimation (Olson et al., 2023).
PDE Control and Budget-Conscious Inference: Compute-adaptive surrogates for time-evolving PDEs dynamically modulate patch/stride size (CKM/CSM) for efficient, high-quality predictions tailored to available computational resources (Mukhopadhyay et al., 12 Jul 2025).
Controllable Image Synthesis: Deep fusion mechanisms (KV-context augmentation), LoRA modules, and conditional-adaptive attention in diffusion transformers enable precise, selective control over text-to-image outputs without duplicating the backbone (Liu et al., 14 Aug 2025).

Limitations include requirement for sufficient validation signal (even single-shot) to select hyperparameters; architecture dependence on tokenizable structure; and need for further refinement in rolling out to continuous or multi-dimensional partial-control settings (e.g., pulse shaping).

7. Future Directions and Open Challenges

Current transformer surrogate frameworks for partial control highlight several open avenues:

Extension to Infinite-Horizon/Real-Time Control: Fixed context length in decision transformers and inflexible horizon limits in explicit sequence models restrict deployment in infinite-horizon or high-frequency real-time control (Zhang et al., 2024).
Handling Non-Stationarity and Dynamic Control Sets: Further architectural innovations are required to encode absolute time or adapt dynamically to changing control subsets in non-stationary policies (Wu et al., 9 Sep 2025).
Scalability in High-Dimensional Partial Control: For scenarios with high-dimensional actions or observations, more sophisticated embedding and attention masking mechanisms—including causal attention and advanced conditional fusion—are needed (Liu et al., 14 Aug 2025).
Integration of Physical Priors and Hybrid Models: Accelerating generalization and robustness may benefit from integrating transformer surrogates with physically-informed constraints or hybrid graph/neural architectures (Olson et al., 2023).
Compute-Cost–Quality Tradeoff Automation: Further research is warranted on automated scheduling strategies that optimize patch/stride modulation and resource allocation, especially for applications in real-time scientific computing and hardware-constrained deployments (Mukhopadhyay et al., 12 Jul 2025).

These directions offer pathways to unify learning-based surrogates with tailored partial control for broad deployment in high-stakes scientific, industrial, and engineering contexts.