Poseidon: Efficient Foundation Models for PDEs
Abstract: We introduce Poseidon, a foundation model for learning the solution operators of PDEs. It is based on a multiscale operator transformer, with time-conditioned layer norms that enable continuous-in-time evaluations. A novel training strategy leveraging the semi-group property of time-dependent PDEs to allow for significant scaling-up of the training data is also proposed. Poseidon is pretrained on a diverse, large scale dataset for the governing equations of fluid dynamics. It is then evaluated on a suite of 15 challenging downstream tasks that include a wide variety of PDE types and operators. We show that Poseidon exhibits excellent performance across the board by outperforming baselines significantly, both in terms of sample efficiency and accuracy. Poseidon also generalizes very well to new physics that is not seen during pretraining. Moreover, Poseidon scales with respect to model and data size, both for pretraining and for downstream tasks. Taken together, our results showcase the surprising ability of Poseidon to learn effective representations from a very small set of PDEs during pretraining in order to generalize well to unseen and unrelated PDEs downstream, demonstrating its potential as an effective, general purpose PDE foundation model. Finally, the Poseidon model as well as underlying pretraining and downstream datasets are open sourced, with code being available at https://github.com/camlab-ethz/poseidon and pretrained models and datasets at https://huggingface.co/camlab-ethz.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Unresolved Gaps, Limitations, and Open Questions
Below is a consolidated list of concrete gaps and open questions that the paper leaves unresolved, framed to guide future research.
- Theoretical guarantees: No formal analysis of approximation error, stability, or generalization for scOT or Poseidon (e.g., bounds on operator error under distribution shift, guarantees for continuous-in-time evaluation, or conditions ensuring semigroup consistency).
- Semigroup-based training validity: The all2all loss assumes a time-homogeneous semigroup; it is unclear how valid/effective it is for non-autonomous PDEs (time-dependent coefficients/forcing) where only an evolution family (two-parameter propagator) exists.
- Bias/variance in all2all: The quadratic reuse of within-trajectory pairs may introduce strong correlations, potentially inflating the “effective” sample size; the impact on optimization stability and generalization is unquantified.
- Fairness of comparisons: Baselines (e.g., FNO/CNO trained from scratch) do not appear to leverage the same all2all objective; ablations that train baselines with identical loss and sampling protocols are missing.
- Temporal extrapolation: The model uses normalized lead time in [0,1] and linear time-conditioned layer norms; the ability to extrapolate beyond the training horizon, or to variable final times T, is not systematically assessed.
- Continuous-time fidelity: Although evaluations are “continuous-in-time,” there is no verification that learned operators approximately satisfy the semigroup property S(t+s, a) ≈ S(t, S(s, a)) or that rollouts are stable over long times.
- Autoregressive stability: No systematic study of error accumulation, stability regimes, or drift under autoregressive rollouts vs. direct prediction, especially for long horizons or chaotic regimes.
- Physical constraints: The architecture does not enforce conservation laws, divergence-free conditions, or boundary condition compliance; the magnitude and impact of physics violations are unreported.
- Boundary conditions and domains: Pretraining is on [0,1]2 (mostly periodic); generalization to complex geometries, non-rectangular domains, curved boundaries, mixed BCs (Neumann/Robin/flux), and spatially varying BCs remains untested.
- Resolution and mesh generalization: It is unclear whether scOT generalizes across spatial resolutions, non-uniform grids, or mesh-based discretizations; explicit cross-resolution experiments are missing.
- Dimensionality scaling: All experiments appear to be 2D; scalability to 3D (memory/compute costs of windowed attention, accuracy, training stability) is not explored.
- Broader PDE coverage: Pretraining only on Euler/NS (fluids) leaves open generalization to fundamentally different PDE classes (e.g., Maxwell, elasticity, Schrödinger, reaction–diffusion, porous media, nonlocal/integral, stochastic PDEs).
- Stiffness and extreme regimes: Robustness in stiff regimes (e.g., combustion chemistry), high-Mach shocks, very high Reynolds, or multi-scale separation is not assessed.
- Noise robustness and real data: All training/evaluation uses clean numerically generated data; robustness to sensor noise, missing data, irregular sampling, and domain shift in real experiments is unaddressed.
- Uncertainty quantification: There is no predictive uncertainty, calibration, or OOD detection; methods to quantify/model uncertainty in operator predictions are absent.
- Interpretability and mechanism: Case studies are qualitative; a systematic, quantitative analysis of which representations transfer across PDEs (e.g., via probing, CKA, or mechanistic interpretability) is missing.
- Time embedding capacity: Time conditioning is linear (α(t), β(t)); the necessity of richer temporal embeddings (e.g., Fourier features, small MLPs, neural ODEs) and their effects on accuracy/stability remain unexplored.
- Cross-task channel alignment: The ad hoc augmentation with constant-one channels to match field dimensionalities may be suboptimal; principled cross-operator tokenization/conditioning strategies are needed.
- Finetuning protocol: There is no ablation on which layers to finetune vs. freeze, learning-rate schedules, or parameter-efficient finetuning (adapters/LoRA) to optimize data/computational efficiency.
- Scaling laws: While empirical scaling is shown, quantitative scaling laws (exponents, data/model sizes vs. error) are not fitted or compared across tasks, limiting predictability of returns to scale.
- Data diversity vs. quantity: The study tests one diversity ablation; a systematic mapping of “which operators and distributions” drive transfer (coverage metrics, diversity measures) is absent.
- Compute and efficiency: Training/inference wall-clock costs, memory footprints, and energy/carbon metrics are not reported, limiting assessments of practicality vs. alternatives.
- PDE parameterization: Generalization across continuous parameter spaces (e.g., viscosity, wave speed, permeability), and systematic extrapolation/interpolation in parameter ranges, is not evaluated.
- Boundary/operator conditioning: Methods to explicitly condition on boundary operators, geometry, and PDE coefficients to enable zero-shot domain/BC transfer are not developed.
- Multi-physics coupling: Only limited forcing/gravity/tracer additions are tested; transfer to tightly coupled multi-physics (e.g., FSI, MHD) is unassessed.
- Time-independent PDEs: Treating elliptic/steady problems as long-time limits is a heuristic; the lack of residual-based or steady-state constraints may limit performance on general steady problems.
- Data generation bias: Pretraining relies on solver-generated data; sensitivity to solver choice, discretization errors, and numerical diffusion/dispersion is not studied.
- Failure modes: The paper reports strong results but does not catalog failure cases (e.g., blow-ups, oscillations, boundary artifacts), making robustness limits opaque.
- Alternative objectives: The work uses L1 data loss; the benefits of physics-informed residual losses, multi-objective losses (e.g., spectral/gradient), or curriculum/self-training are not explored.
- Nonlocal and fractional operators: Generalization to nonlocal kernels or fractional derivatives (common in anomalous transport) is not addressed.
- Security/adversarial robustness: Susceptibility to adversarial or worst-case perturbations of inputs (initial/boundary data, coefficients) is unknown.
- Reproducibility of downstream splits: Although datasets are open, details ensuring strict no-leakage between pretraining and downstream distributions (e.g., seeds, generator parameters) should be formalized and audited.
- Formalizing transfer: The paper poses but does not resolve the core scientific question: under what structural similarities between PDE operators (e.g., shared invariances, spectral profiles, local interaction patterns) does cross-PDE transfer emerge, and how can this be predicted or optimized?
Collections
Sign up for free to add this paper to one or more collections.