Scalable Operator Transformer (scOT)

Updated 29 January 2026

scOT is a neural operator framework that learns mappings between infinite-dimensional function spaces for complex PDEs.
It incorporates advanced techniques such as inducing points, domain decomposition, and mixture-of-experts to efficiently handle irregular geometries and multi-scale data.
Empirical benchmarks demonstrate that scOT variants achieve state-of-the-art accuracy and scalability in applications ranging from climate modeling to electronic structure.

Scalable Operator Transformer (scOT) refers to a class of neural operator architectures explicitly designed to learn mappings between infinite-dimensional function spaces—such as those arising in partial differential equations (PDEs)—with unconditional scalability, geometric flexibility, and state-of-the-art computational efficiency. scOTs generalize the transformer paradigm to the operator-learning setting by introducing task-aligned architectural innovations, including locality/expert mixtures, geometric and multiscale processing, and bottlenecking strategies that decouple model complexity from discretization size. These models have established new benchmarks in operator learning for scientific computing, with applications spanning engineering, climate modeling, and electronic structure.

1. Architectural Principles and General Design

The scalable operator transformer concept encompasses several instantiations, including the Inducing Point Operator Transformer (IPOT), Mondrian, GAOT, Mixture-of-Experts Operator Transformers (MoE-POT), and Σ-Attention for quantum many-body systems. All variants retain the core operator-learning objective: approximating solution operators $T: \mathcal{V}^m \rightarrow \mathcal{U}^n$ , typically with $\mathcal{V}, \mathcal{U}$ Banach spaces of functions, and $T$ the forward or time-evolution map of a (possibly parametrized) PDE or integral equation.

Common defining features:

Encode–Process–Decode Structure: Input field data are encoded as tokens using domain-aware embeddings/graph neural features, processed by variants of transformer blocks (attention and nonlinearities), and decoded at arbitrary output query locations for mesh-agnostic evaluation (Lee et al., 2023, Wen et al., 24 May 2025).
Scalability: Linear or near-linear compute/memory growth with respect to input/output resolution, achieved via bottlenecking techniques (inducing points, domain decomposition, patching, expert sparsification, or subdomain-localization).
Geometric and Multiscale Awareness: Direct handling of irregular geometries, point clouds, and multi-resolution data, enabled by neural graph operators, geometry embeddings, or subdomain operator blocks (Wen et al., 24 May 2025, Feeney et al., 9 Jun 2025).
Operator-specific Attention Mechanisms: Attention layers generalize classical kernel integrals, acting over functions or subdomains, rather than sequence tokens only.

2. Notable Instantiations

Inducing Point Operator Transformer (IPOT)

IPOT is a mesh-agnostic, attention-based neural operator for PDE solution operators. It uses:

Encoder: Cross-attends learnable inducing queries $Z_q \in \mathbb{R}^{n_z \times d_z}$ over input fields and positional features, producing a fixed-size latent embedding.
Processor: Self-attention blocks on the $n_z$ -dimensional latent, decoupling model depth $L$ from discretization size.
Decoder: Cross-attends from arbitrary output query locations to induce predictions.

The bottleneck $n_z$ controls compute and accuracy; typically $n_z \ll$ input/output resolution ( $n_x, n_y$ ), enabling linear scaling. Positional encodings use Fourier features (Lee et al., 2023).

Mondrian

Mondrian replaces quadratic-complexity attention over all input points with**

Domain decomposition: Partitions the domain into $s$ subdomains.
Subdomain-restricted attention: Each local function is mapped by neural operators (integral, mixture, spectral) to $\mathcal{V}, \mathcal{U}$ 0, $\mathcal{V}, \mathcal{U}$ 1, $\mathcal{V}, \mathcal{U}$ 2 triplets; attention is computed over subdomains via $\mathcal{V}, \mathcal{U}$ 3 inner products.
Hierarchical/Neighborhood Extension: Windowed attention and local masking support multiscale operator learning and enable linear or $\mathcal{V}, \mathcal{U}$ 4 scaling as $\mathcal{V}, \mathcal{U}$ 5.

Performance is independent of discretization density, and the approach is suited to large, multiscale domains (Feeney et al., 9 Jun 2025).

Geometry Aware Operator Transformer (GAOT/scOT)

GAOT integrates:

Multiscale Attentional Graph Neural Operator (MAGNO) encoders: Aggregate local graph-based features at several geometric scales, fused with domain geometry descriptors.
Transformer Processor: Optionally grouped into spatial patches for efficient self-attention.
Geometry Embeddings: Local descriptors, Fourier, or Laplace eigenfunctions, encoding domain shape.
Decoder: Reverse MAGNO, aggregating and decoding latent features at query locations.

scOT achieves low median relative errors (<1% on 2D time-independent PDEs) and superior throughput/memory efficiency in large-scale industrial and scientific tasks (Wen et al., 24 May 2025).

Mixture-of-Experts Operator Transformer (MoE-POT)

MoE-POT addresses parameter-efficiency and data heterogeneity by incorporating sparse Mixture-of-Experts layers:

Expert Routing: A router network dynamically activates $\mathcal{V}, \mathcal{U}$ 6 of $\mathcal{V}, \mathcal{U}$ 7 routed experts per sample, with $\mathcal{V}, \mathcal{U}$ 8 always-active shared experts for universal priors, keeping computational cost ≈33% of total parameters.
Specialization and Generalization: Sparse gating encourages experts to specialize for different PDE families/datasets while retaining shared structure for commonalities.
Scaling Laws: Empirically, error decreases with total parameter count even as inference cost grows slowly; interpretability analysis reveals emergent dataset clustering by expert activation (Wang et al., 29 Oct 2025).

Σ-Attention

Σ-Attention specializes to the operator-learning task of self-energy in correlated electronic systems:

Encoder-only Transformer: Processes input tensors $\mathcal{V}, \mathcal{U}$ 9 for momentum-frequency slices.
Loss and Training: Joint $T$ 0/ $T$ 1 losses over composite datasets from perturbative and exact methods, providing accurate generalization across coupling regimes and system sizes.

Empirically, achieves $T$ 2 error in Green’s function $T$ 3 significantly below perturbative baselines and enables extension to larger quantum systems (Zhu et al., 20 Apr 2025).

3. Mathematical Underpinnings and Complexity

At core, attention layers in scOT approximate classical kernel integral operators:

$T$ 4

via multiheaded attention with learned projections for $T$ 5, $T$ 6, $T$ 7. The design ensures that:

Bottlenecking (e.g., IPOT): Main cost $T$ 8 for input/output sizes $T$ 9.
Domain Decomposition (Mondrian): Overall cost $Z_q \in \mathbb{R}^{n_z \times d_z}$ 0 for $Z_q \in \mathbb{R}^{n_z \times d_z}$ 1 subdomains, balancing attention and local operator evaluations.
MoE Layer Sparsity: Only a fraction $Z_q \in \mathbb{R}^{n_z \times d_z}$ 2 of expert parameters are activated per input sample, decoupling maximal model capacity from inference FLOPs.
Patch-based Processing (GAOT): Transformer token count $Z_q \in \mathbb{R}^{n_z \times d_z}$ 3 for patch size $Z_q \in \mathbb{R}^{n_z \times d_z}$ 4 keeps quadratic attention manageable.

Resulting architectures enable discrete-invariance—for fixed latent representation or decomposition, predictions remain valid across different resolutions $Z_q \in \mathbb{R}^{n_z \times d_z}$ 5, $Z_q \in \mathbb{R}^{n_z \times d_z}$ 6, or domain discretizations.

4. Empirical Performance and Benchmarks

scOTs have consistently demonstrated competitive or superior accuracy and efficiency relative to prior neural operator methods:

Benchmark	scOT Variant	Relative Error (%)	Throughput (samples/s)	Latency (ms)
Poisson (2D)	GAOT	0.83	>70 (up to 50k points)	6.97
Navier–Stokes, $Z_q \in \mathbb{R}^{n_z \times d_z}$ 7	IPOT	0.89	47.4	21.1
ERA5 Forecast, $Z_q \in \mathbb{R}^{n_z \times d_z}$ 8	IPOT	0.66	~100	9.8
Allen–Cahn 128x128	Mondrian	0.81	N/A	N/A
NS (multiple datasets)	MoE-POT	~5–5.5	Matches dense model	16.6
3D CFD (DrivAerNet++)	GAOT	4.94 (MSE x10^-2)	N/A	N/A

Significantly, scOTs achieve discretization invariance, adapt to irregular/complex geometries, and scale to samples with $Z_q \in \mathbb{R}^{n_z \times d_z}$ 9 points without loss in accuracy or tractability (Lee et al., 2023, Wen et al., 24 May 2025, Feeney et al., 9 Jun 2025, Wang et al., 29 Oct 2025).

5. Implementation, Training, and Practical Considerations

scOT implementations exploit standard deep learning infrastructure (PyTorch, CUDA, FlashAttention) and specialized routines for graph operations, neighbor searches, and patching in high-dimensional spaces. Recommended practices include:

Precomputing Graph Edges: For geometric encoders, caching on disk accelerates training.
Mixed-precision Training: Reduces peak memory, critical at large resolutions.
Task-specific Hyperparameters: Adaptifying latent sizes, number of attention heads, and depth $n_z$ 0 for problem scale.
Scalable Parallelism: Both sample-wise (across batch) and token-wise (within transformer) parallelization are effective.

Optimization strategies (AdamW, cosine annealing, auxiliary load-balancing for experts) and data augmentation (auto-regressive denoising, multi-dataset mixing) are typical. Scalability to 3D and fine-resolution benchmarks is empirically validated.

6. Significance and Impact on Operator Learning

scOTs mark a transition in operator learning methodology:

They unify domain-aware representation learning (as in GNNs and geometric ML), scalable transformer architectures, and operator-theoretic kernel approximations.
Effective generalization across discretizations, domains, and equation classes is empirically confirmed.
Mixture-of-experts and local-global decompositions open paths to further scaling and transfer across scientific domains.
A plausible implication is that scOTs can underpin new-generation neural surrogates for real-time, large-scale simulation in science and engineering.

The robust interpretability of expert gating and the transfer of learned features across datasets suggest continued advances in universality and specialization—a notable direction for pre-trained universal scientific operators.

7. Current Challenges and Outlook

Open technical questions include:

Automated selection of bottleneck sizes for optimal accuracy/computation trade-off.
Theoretical analysis of expressivity and convergence in function-space transformer architectures.
Integration of physical priors, uncertainty quantification, and long-term stable rollouts in dynamical systems.
Extension to higher-dimensional or multi-physics coupling, possibly via hierarchical or sparse two-step attention.

Continued comparative benchmarking and development of scalable toolkits for domain scientists remain priorities for the field.

References:

"Inducing Point Operator Transformer: A Flexible and Scalable Architecture for Solving PDEs" (Lee et al., 2023).
"Geometry Aware Operator Transformer as an Efficient and Accurate Neural Surrogate for PDEs on Arbitrary Domains" (Wen et al., 24 May 2025).
"Mondrian: Transformer Operators via Domain Decomposition" (Feeney et al., 9 Jun 2025).
"Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training" (Wang et al., 29 Oct 2025).
" $n_z$ 1-Attention: A Transformer-based operator learning framework for self-energy in strongly correlated systems" (Zhu et al., 20 Apr 2025).